18:54:04 <DataHoarder> I have been experimenting from making the ma, mx prefetch registers go from 2 to 3/4 (effectively converting it into a ring buffer.) what this means is that we can prefetch several vm iterations ahead instead of just for the next one. for v2 (program size 384 and ring size=2->3), looking at pipeline performance counters on zen5 makes it go from
18:54:06 <DataHoarder> stalled due to 27% bound by memory to 19%, while increasing hashrate from (on my test setup) from 10kH/s to 12 kH/s. With ring size 4, that is 13 kH/s. So it effectively does way more work, uses more memory bandwidth, and stalls less
18:54:37 <DataHoarder> Here's a table of some of the tests and hashrates across Zen5/Zen3/ and a random i7-7700K I got around
18:54:39 <DataHoarder> https://paste.debian.net/hidden/1677baa9
18:55:09 <DataHoarder> talking with sech1 suggests doing program size 384 (As is currently in v2) and ringSize(n)=3
18:55:46 <DataHoarder> the hashrates are a bit noisy, specifically for the i7 (no huge pages there) but the perf stats tend to be on point for each run when measured across different runs
18:57:35 <DataHoarder> effectively if you have A, B, C with N=3; at the end of the loop you write the dataset prefetch at C, and then read from location at A, then the next iteration it'd be A, B; then B, C; then back to C, A (a ring)
19:22:36 <DataHoarder> oh, perf counters on zen5 as well:
19:22:38 <DataHoarder> size 384, n2 https://paste.debian.net/hidden/358363f8 10kH/s
19:22:40 <DataHoarder> size 384, n3 https://paste.debian.net/hidden/0cd2347c 12kH/s
19:23:11 <DataHoarder> this is using perf stat --metrics PipelineL1,PipelineL2
19:23:13 <DataHoarder> Equivalent to counters listed on https://docs.amd.com/r/en-US/57368-uProf-user-guide/Pipeline-Utilization
19:24:11 <DataHoarder> the % in parenthesis is just what time was spent measuring that metric, not the actual %
19:43:28 <sech1> TLDR n=3 effectively makes v2 hashrate bigger than v1,  while still doing more than 1.5x work
19:45:56 <sech1> The other side of the coin that it allows to calculate two superscalar hashes at the same time, but it should be fine because they will still burn the same amount of energy per hash
19:48:27 <sech1> This will speed up the light mode though...
19:49:06 <sech1> Light mode is about burning more energy, not hashrate per se, right?
19:57:25 <eureka> less stalling is good, keep the memory busy
19:57:55 <DataHoarder> n4 was giving an extra +1KH/s to Zen5
19:58:10 <DataHoarder> I tested n=8 too, but that was just another 800H/s over n=4
19:59:27 <DataHoarder> (for context, 9900X3D with 2x CP64G56C46U5.M16B1 running in low power mode)
20:00:28 <DataHoarder> previous benchmark with just large pages on v1 https://xmrig.com/benchmark/361m7w
20:03:12 <DataHoarder> Zen3 received nothing much more going from n=2->n=3, and the i7 is mostly the same across all
20:04:37 <sech1> Zen 3 is probably  memory bound, but on the RAM side, not CPU side
20:04:55 <sech1> They just can't give more hashrate
20:05:10 <DataHoarder> yeah, it's also a cursed suboptimal setup with only 2x LRDIMM out of 8x populated
20:05:12 <DataHoarder> that'd be something for any future tweaks, anyhow.
20:05:13 <sech1> I mean your specific Zen 3 build
20:05:22 <DataHoarder> I don't doubt so :)
20:06:30 <sech1> 24 core Zen 6 will def hit 40+ kh/s with this tweak, and will get RAM limited too...
20:07:01 <sech1> Or more like it will be right on the edge of being RAM bound
20:07:09 <DataHoarder> I have the testing changes that support setting what I called ProgramPrefetchRingSize on my go-randomx auto-test branch https://git.gammaspectra.live/P2Pool/go-randomx/src/branch/auto-test
20:07:14 <sech1> Assuming well tweaked timings of course
20:08:16 <DataHoarder> one could hope less memory bound I/O chiplet on zen6 :)
20:10:58 <DataHoarder> would this make it less sensitive to memory latency? if prefetch did too much ahead I guess so, n=3 seems to be on that limit for zen5
20:11:22 <DataHoarder> (to the point where it becomes bandwidth limited, not latency)
20:14:28 <sech1> It's less  sensitive to latency, but it can easily max out the bandwidth of random 64-byte accesses, because it doubles the amount of in-flight accesses
20:14:57 <sech1> So it's still sensitive to latency because latency defines this ceiling
20:15:10 <sech1> Latency and memory controller
20:15:49 <sech1> That said, only 24 core Zen 6 can max it out
20:16:03 <sech1> All other AM5 CPUs are fine
20:22:56 <DataHoarder> would be nice to see new benchmarks not done with just my library :)
20:34:07 <sech1> I will try to implement it by tomorrow evening, for x64 JIT
20:47:28 <DataHoarder> if you are specifically doing just n=3 you already had an optimal setup for it (which can probably reuse the existing register instead of temp. stack like I do, but I support n=2 to n=4 (or n=8 with changes)
21:16:41 <sech1> Yes, I can shift 96 bits through the 64-bit register that holds mx/ma. 32 bit go out on one side, 32 bit go in on the other side
21:38:01 <eureka> 24 core zen 6 ... is that dual 12 core CCD?
22:42:06 <sech1> Yes, zen 6 will have 12 core CCD
22:42:19 <sech1> zen 6c will have 32 core CCD, but it's only for servers