#monero-pow

21:44

tevador

tevador/RandomX #274
21:44

tevador

I'm getting about +5% with Zen3
21:48

tevador

Interestingly, I can squeeze up to 64 AES rounds at the end of the loop before hashrate starts decreasing. The current draft only has 16 rounds. TBD if the impact on soft AES systems is acceptable.
21:52

sech1

64 rounds when running single threads, or all threads on all cores?
21:56

tevador

1 thread
21:57

sech1

Need to test 2 threads running on the same core - this will slow down AES rounds, but L3 latency will stay the same
21:57

tevador

I don't have enough L3 to test with all threads.
21:57

sech1

I don't think you can squeeze more than 32 rounds
21:58

tevador

I think 16 rounds are find, this already doubles the amount of AES per hash.
21:58

tevador

fine*
21:58

sech1

yes
21:59

sech1

and it is in line with how we already use AES (4 rounds for program buffer, for example)
22:00

sech1

you can set thread affinity and run 2 threads on the same core to test how many rounds is possible
22:02

tevador

the code is available, anyone can try it
22:12

sech1

16 -> 32 rounds, -0.7% hashrate (2 threads on one core)
22:12

sech1

32 -> 64 rounds, -3.25% (2 threads on one core)
22:12

sech1

so even 32 rounds is noticeable
22:12

tevador

and xor -> 16 rounds?
22:15

sech1

less than 1 h/s difference
22:16

sech1

so < 0.1%
22:16

sech1

command line was "--mine --jit --largePages --threads 2 --affinity 3 --init 16"
22:16

tevador

so 16 rounds seems to be a good choice
22:16

sech1

yes
22:17

sech1

0.7% drop with 32 rounds is a lot. I consider 0.1% speedup to be significant these days, when I optimize XMRig
22:17

sech1

but I'm testing on Ryzen 7 1700 (Zen 1)
22:17

sech1

this CPU has the fastest L3
22:18

sech1

Zen 3 / Zen 4 will have smaller drop on 32 rounds
22:18

sech1

still, 16 rounds should be optimal
22:19

sech1

because it will be unnoticeable on all CPUs
22:19

sech1

*with hardware AES
22:20

tevador

btw, to measure the CFROUND effect, you can change line 125 in common.hpp to using JitCompiler = JitCompilerX86<RANDOMX_FLAG_DEFAULT>; and rebuild
22:20

tevador

I measured about 5% with 1 thread on Ryzen 5850U
22:22

sech1

926.4 h/s -> 1004.2 h/s on Ryzen 7 1700 (2 threads on 1 core)
22:22

sech1

so 8% speedup
22:23

sech1

and it was 1004.2 -> 1003.8 h/s when I changed xor to 16 aes rounds
22:23

tevador

pretty significant
22:24

sech1

yes, all Zen CPUs implement mxcsr instructions in microcode
22:24

sech1

Intel is much faster with it
22:26

sech1

but this change shouldn't make Intel CPUs worse, it will just give smaller boost (0-1%)

2 years ago

« a day earlier

a day later »

today »