-
DataHoarder
I have been experimenting from making the ma, mx prefetch registers go from 2 to 3/4 (effectively converting it into a ring buffer.) what this means is that we can prefetch several vm iterations ahead instead of just for the next one. for v2 (program size 384 and ring size=2->3), looking at pipeline performance counters on zen5 makes it go from
-
DataHoarder
stalled due to 27% bound by memory to 19%, while increasing hashrate from (on my test setup) from 10kH/s to 12 kH/s. With ring size 4, that is 13 kH/s. So it effectively does way more work, uses more memory bandwidth, and stalls less
-
DataHoarder
Here's a table of some of the tests and hashrates across Zen5/Zen3/ and a random i7-7700K I got around
-
DataHoarder
-
DataHoarder
talking with sech1 suggests doing program size 384 (As is currently in v2) and ringSize(n)=3
-
DataHoarder
the hashrates are a bit noisy, specifically for the i7 (no huge pages there) but the perf stats tend to be on point for each run when measured across different runs
-
DataHoarder
effectively if you have A, B, C with N=3; at the end of the loop you write the dataset prefetch at C, and then read from location at A, then the next iteration it'd be A, B; then B, C; then back to C, A (a ring)
-
DataHoarder
oh, perf counters on zen5 as well:
-
DataHoarder
-
DataHoarder
-
DataHoarder
this is using perf stat --metrics PipelineL1,PipelineL2
-
DataHoarder
-
DataHoarder
the % in parenthesis is just what time was spent measuring that metric, not the actual %
-
sech1
TLDR n=3 effectively makes v2 hashrate bigger than v1, while still doing more than 1.5x work
-
sech1
The other side of the coin that it allows to calculate two superscalar hashes at the same time, but it should be fine because they will still burn the same amount of energy per hash
-
sech1
This will speed up the light mode though...
-
sech1
Light mode is about burning more energy, not hashrate per se, right?
-
eureka
less stalling is good, keep the memory busy
-
DataHoarder
n4 was giving an extra +1KH/s to Zen5
-
DataHoarder
I tested n=8 too, but that was just another 800H/s over n=4
-
DataHoarder
(for context, 9900X3D with 2x CP64G56C46U5.M16B1 running in low power mode)
-
DataHoarder
previous benchmark with just large pages on v1
xmrig.com/benchmark/361m7w
-
DataHoarder
Zen3 received nothing much more going from n=2->n=3, and the i7 is mostly the same across all
-
sech1
Zen 3 is probably memory bound, but on the RAM side, not CPU side
-
sech1
They just can't give more hashrate
-
DataHoarder
yeah, it's also a cursed suboptimal setup with only 2x LRDIMM out of 8x populated
-
DataHoarder
that'd be something for any future tweaks, anyhow.
-
sech1
I mean your specific Zen 3 build
-
DataHoarder
I don't doubt so :)
-
sech1
24 core Zen 6 will def hit 40+ kh/s with this tweak, and will get RAM limited too...
-
sech1
Or more like it will be right on the edge of being RAM bound
-
DataHoarder
I have the testing changes that support setting what I called ProgramPrefetchRingSize on my go-randomx auto-test branch
git.gammaspectra.live/P2Pool/go-randomx/src/branch/auto-test
-
sech1
Assuming well tweaked timings of course
-
DataHoarder
one could hope less memory bound I/O chiplet on zen6 :)
-
DataHoarder
would this make it less sensitive to memory latency? if prefetch did too much ahead I guess so, n=3 seems to be on that limit for zen5
-
DataHoarder
(to the point where it becomes bandwidth limited, not latency)
-
sech1
It's less sensitive to latency, but it can easily max out the bandwidth of random 64-byte accesses, because it doubles the amount of in-flight accesses
-
sech1
So it's still sensitive to latency because latency defines this ceiling
-
sech1
Latency and memory controller
-
sech1
That said, only 24 core Zen 6 can max it out
-
sech1
All other AM5 CPUs are fine
-
DataHoarder
would be nice to see new benchmarks not done with just my library :)
-
sech1
I will try to implement it by tomorrow evening, for x64 JIT
-
DataHoarder
if you are specifically doing just n=3 you already had an optimal setup for it (which can probably reuse the existing register instead of temp. stack like I do, but I support n=2 to n=4 (or n=8 with changes)
-
sech1
Yes, I can shift 96 bits through the 64-bit register that holds mx/ma. 32 bit go out on one side, 32 bit go in on the other side
-
eureka
24 core zen 6 ... is that dual 12 core CCD?
-
sech1
Yes, zen 6 will have 12 core CCD
-
sech1
zen 6c will have 32 core CCD, but it's only for servers