21:44:18 https://github.com/tevador/RandomX/pull/274 21:44:40 I'm getting about +5% with Zen3 21:48:31 Interestingly, I can squeeze up to 64 AES rounds at the end of the loop before hashrate starts decreasing. The current draft only has 16 rounds. TBD if the impact on soft AES systems is acceptable. 21:52:12 64 rounds when running single threads, or all threads on all cores? 21:56:42 1 thread 21:57:24 Need to test 2 threads running on the same core - this will slow down AES rounds, but L3 latency will stay the same 21:57:31 I don't have enough L3 to test with all threads. 21:57:32 I don't think you can squeeze more than 32 rounds 21:58:37 I think 16 rounds are find, this already doubles the amount of AES per hash. 21:58:43 fine* 21:58:44 yes 21:59:19 and it is in line with how we already use AES (4 rounds for program buffer, for example) 22:00:32 you can set thread affinity and run 2 threads on the same core to test how many rounds is possible 22:02:36 the code is available, anyone can try it 22:12:19 16 -> 32 rounds, -0.7% hashrate (2 threads on one core) 22:12:44 32 -> 64 rounds, -3.25% (2 threads on one core) 22:12:55 so even 32 rounds is noticeable 22:12:58 and xor -> 16 rounds? 22:15:46 less than 1 h/s difference 22:16:09 so < 0.1% 22:16:31 command line was "--mine --jit --largePages --threads 2 --affinity 3 --init 16" 22:16:33 so 16 rounds seems to be a good choice 22:16:38 yes 22:17:14 0.7% drop with 32 rounds is a lot. I consider 0.1% speedup to be significant these days, when I optimize XMRig 22:17:46 but I'm testing on Ryzen 7 1700 (Zen 1) 22:17:59 this CPU has the fastest L3 22:18:10 Zen 3 / Zen 4 will have smaller drop on 32 rounds 22:18:34 still, 16 rounds should be optimal 22:19:20 because it will be unnoticeable on all CPUs 22:19:29 *with hardware AES 22:20:01 btw, to measure the CFROUND effect, you can change line 125 in common.hpp to using JitCompiler = JitCompilerX86; and rebuild 22:20:45 I measured about 5% with 1 thread on Ryzen 5850U 22:22:42 926.4 h/s -> 1004.2 h/s on Ryzen 7 1700 (2 threads on 1 core) 22:22:52 so 8% speedup 22:23:15 and it was 1004.2 -> 1003.8 h/s when I changed xor to 16 aes rounds 22:23:20 pretty significant 22:24:50 yes, all Zen CPUs implement mxcsr instructions in microcode 22:24:59 Intel is much faster with it 22:26:15 but this change shouldn't make Intel CPUs worse, it will just give smaller boost (0-1%)