18:54:04 I have been experimenting from making the ma, mx prefetch registers go from 2 to 3/4 (effectively converting it into a ring buffer.) what this means is that we can prefetch several vm iterations ahead instead of just for the next one. for v2 (program size 384 and ring size=2->3), looking at pipeline performance counters on zen5 makes it go from 18:54:06 stalled due to 27% bound by memory to 19%, while increasing hashrate from (on my test setup) from 10kH/s to 12 kH/s. With ring size 4, that is 13 kH/s. So it effectively does way more work, uses more memory bandwidth, and stalls less 18:54:37 Here's a table of some of the tests and hashrates across Zen5/Zen3/ and a random i7-7700K I got around 18:54:39 https://paste.debian.net/hidden/1677baa9 18:55:09 talking with sech1 suggests doing program size 384 (As is currently in v2) and ringSize(n)=3 18:55:46 the hashrates are a bit noisy, specifically for the i7 (no huge pages there) but the perf stats tend to be on point for each run when measured across different runs 18:57:35 effectively if you have A, B, C with N=3; at the end of the loop you write the dataset prefetch at C, and then read from location at A, then the next iteration it'd be A, B; then B, C; then back to C, A (a ring) 19:22:36 oh, perf counters on zen5 as well: 19:22:38 size 384, n2 https://paste.debian.net/hidden/358363f8 10kH/s 19:22:40 size 384, n3 https://paste.debian.net/hidden/0cd2347c 12kH/s 19:23:11 this is using perf stat --metrics PipelineL1,PipelineL2 19:23:13 Equivalent to counters listed on https://docs.amd.com/r/en-US/57368-uProf-user-guide/Pipeline-Utilization 19:24:11 the % in parenthesis is just what time was spent measuring that metric, not the actual % 19:43:28 TLDR n=3 effectively makes v2 hashrate bigger than v1, while still doing more than 1.5x work 19:45:56 The other side of the coin that it allows to calculate two superscalar hashes at the same time, but it should be fine because they will still burn the same amount of energy per hash 19:48:27 This will speed up the light mode though... 19:49:06 Light mode is about burning more energy, not hashrate per se, right? 19:57:25 less stalling is good, keep the memory busy 19:57:55 n4 was giving an extra +1KH/s to Zen5 19:58:10 I tested n=8 too, but that was just another 800H/s over n=4 19:59:27 (for context, 9900X3D with 2x CP64G56C46U5.M16B1 running in low power mode) 20:00:28 previous benchmark with just large pages on v1 https://xmrig.com/benchmark/361m7w 20:03:12 Zen3 received nothing much more going from n=2->n=3, and the i7 is mostly the same across all 20:04:37 Zen 3 is probably memory bound, but on the RAM side, not CPU side 20:04:55 They just can't give more hashrate 20:05:10 yeah, it's also a cursed suboptimal setup with only 2x LRDIMM out of 8x populated 20:05:12 that'd be something for any future tweaks, anyhow. 20:05:13 I mean your specific Zen 3 build 20:05:22 I don't doubt so :) 20:06:30 24 core Zen 6 will def hit 40+ kh/s with this tweak, and will get RAM limited too... 20:07:01 Or more like it will be right on the edge of being RAM bound 20:07:09 I have the testing changes that support setting what I called ProgramPrefetchRingSize on my go-randomx auto-test branch https://git.gammaspectra.live/P2Pool/go-randomx/src/branch/auto-test 20:07:14 Assuming well tweaked timings of course 20:08:16 one could hope less memory bound I/O chiplet on zen6 :) 20:10:58 would this make it less sensitive to memory latency? if prefetch did too much ahead I guess so, n=3 seems to be on that limit for zen5 20:11:22 (to the point where it becomes bandwidth limited, not latency) 20:14:28 It's less sensitive to latency, but it can easily max out the bandwidth of random 64-byte accesses, because it doubles the amount of in-flight accesses 20:14:57 So it's still sensitive to latency because latency defines this ceiling 20:15:10 Latency and memory controller 20:15:49 That said, only 24 core Zen 6 can max it out 20:16:03 All other AM5 CPUs are fine 20:22:56 would be nice to see new benchmarks not done with just my library :) 20:34:07 I will try to implement it by tomorrow evening, for x64 JIT 20:47:28 if you are specifically doing just n=3 you already had an optimal setup for it (which can probably reuse the existing register instead of temp. stack like I do, but I support n=2 to n=4 (or n=8 with changes) 21:16:41 Yes, I can shift 96 bits through the 64-bit register that holds mx/ma. 32 bit go out on one side, 32 bit go in on the other side 21:38:01 24 core zen 6 ... is that dual 12 core CCD? 22:42:06 Yes, zen 6 will have 12 core CCD 22:42:19 zen 6c will have 32 core CCD, but it's only for servers