#monero-pow

18:54

DataHoarder

I have been experimenting from making the ma, mx prefetch registers go from 2 to 3/4 (effectively converting it into a ring buffer.) what this means is that we can prefetch several vm iterations ahead instead of just for the next one. for v2 (program size 384 and ring size=2->3), looking at pipeline performance counters on zen5 makes it go from
18:54

DataHoarder

stalled due to 27% bound by memory to 19%, while increasing hashrate from (on my test setup) from 10kH/s to 12 kH/s. With ring size 4, that is 13 kH/s. So it effectively does way more work, uses more memory bandwidth, and stalls less
18:54

DataHoarder

Here's a table of some of the tests and hashrates across Zen5/Zen3/ and a random i7-7700K I got around
18:54

DataHoarder

paste.debian.net/hidden/1677baa9
18:55

DataHoarder

talking with sech1 suggests doing program size 384 (As is currently in v2) and ringSize(n)=3
18:55

DataHoarder

the hashrates are a bit noisy, specifically for the i7 (no huge pages there) but the perf stats tend to be on point for each run when measured across different runs
18:57

DataHoarder

effectively if you have A, B, C with N=3; at the end of the loop you write the dataset prefetch at C, and then read from location at A, then the next iteration it'd be A, B; then B, C; then back to C, A (a ring)
19:22

DataHoarder

oh, perf counters on zen5 as well:
19:22

DataHoarder

size 384, n2 paste.debian.net/hidden/358363f8 10kH/s
19:22

DataHoarder

size 384, n3 paste.debian.net/hidden/0cd2347c 12kH/s
19:23

DataHoarder

this is using perf stat --metrics PipelineL1,PipelineL2
19:23

DataHoarder

Equivalent to counters listed on docs.amd.com/r/en-US/57368-uProf-user-guide/Pipeline-Utilization
19:24

DataHoarder

the % in parenthesis is just what time was spent measuring that metric, not the actual %
19:43

sech1

TLDR n=3 effectively makes v2 hashrate bigger than v1, while still doing more than 1.5x work
19:45

sech1

The other side of the coin that it allows to calculate two superscalar hashes at the same time, but it should be fine because they will still burn the same amount of energy per hash
19:48

sech1

This will speed up the light mode though...
19:49

sech1

Light mode is about burning more energy, not hashrate per se, right?
19:57

eureka

less stalling is good, keep the memory busy
19:57

DataHoarder

n4 was giving an extra +1KH/s to Zen5
19:58

DataHoarder

I tested n=8 too, but that was just another 800H/s over n=4
19:59

DataHoarder

(for context, 9900X3D with 2x CP64G56C46U5.M16B1 running in low power mode)
20:00

DataHoarder

previous benchmark with just large pages on v1 xmrig.com/benchmark/361m7w
20:03

DataHoarder

Zen3 received nothing much more going from n=2->n=3, and the i7 is mostly the same across all
20:04

sech1

Zen 3 is probably memory bound, but on the RAM side, not CPU side
20:04

sech1

They just can't give more hashrate
20:05

DataHoarder

yeah, it's also a cursed suboptimal setup with only 2x LRDIMM out of 8x populated
20:05

DataHoarder

that'd be something for any future tweaks, anyhow.
20:05

sech1

I mean your specific Zen 3 build
20:05

DataHoarder

I don't doubt so :)
20:06

sech1

24 core Zen 6 will def hit 40+ kh/s with this tweak, and will get RAM limited too...
20:07

sech1

Or more like it will be right on the edge of being RAM bound
20:07

DataHoarder

I have the testing changes that support setting what I called ProgramPrefetchRingSize on my go-randomx auto-test branch git.gammaspectra.live/P2Pool/go-randomx/src/branch/auto-test
20:07

sech1

Assuming well tweaked timings of course
20:08

DataHoarder

one could hope less memory bound I/O chiplet on zen6 :)
20:10

DataHoarder

would this make it less sensitive to memory latency? if prefetch did too much ahead I guess so, n=3 seems to be on that limit for zen5
20:11

DataHoarder

(to the point where it becomes bandwidth limited, not latency)
20:14

sech1

It's less sensitive to latency, but it can easily max out the bandwidth of random 64-byte accesses, because it doubles the amount of in-flight accesses
20:14

sech1

So it's still sensitive to latency because latency defines this ceiling
20:15

sech1

Latency and memory controller
20:15

sech1

That said, only 24 core Zen 6 can max it out
20:16

sech1

All other AM5 CPUs are fine
20:22

DataHoarder

would be nice to see new benchmarks not done with just my library :)
20:34

sech1

I will try to implement it by tomorrow evening, for x64 JIT
20:47

DataHoarder

if you are specifically doing just n=3 you already had an optimal setup for it (which can probably reuse the existing register instead of temp. stack like I do, but I support n=2 to n=4 (or n=8 with changes)
21:16

sech1

Yes, I can shift 96 bits through the 64-bit register that holds mx/ma. 32 bit go out on one side, 32 bit go in on the other side
21:38

eureka

24 core zen 6 ... is that dual 12 core CCD?
22:42

sech1

Yes, zen 6 will have 12 core CCD
22:42

sech1

zen 6c will have 32 core CCD, but it's only for servers

6 months ago

« a day earlier

a day later »

today »