-
sech1
-
sech1
dataset init time reduced from 28.294 s to 21.728 s
-
sech1
30% speedup, but it also includes the cache init part which didn't change
-
sech1
cache init is 5-6 seconds, approximately
-
sech1
so dataset init time reduced from ~22 to ~15 seconds
-
sech1
In theory it would be down from 22 to 11 seconds if RAM access wasn't a bottleneck
-
sech1
-
sech1
from 28.294 down to 17.018 seconds to init dataset
-
sech1
-
plowsof
👏
-
sech1
Next on my plan is to write vectorized soft AES for hash/fill AES to speed up this part. After that I'll be comfortable enough with RISC-V assembly and vector instructions to add it to the actual RandomX JIT
-
sech1
(Because this CPU doesn't have hardware AES, sad)
-
sech1
`RxDataset::init` timings: 21865 ms before, 10639 ms after
-
sech1
That's more than 2x speedup, lol
-
sech1
I expected max 2x from vector code
-
DataHoarder
they might have way more vector registers than scalar ones nowadays :D
-
sech1
I guess vector instructions are overall more efficient
-
sech1
Or it's the fact that I do prefetch instructions in vector code, and scalar code doesn't
-
sech1
It's the same number of registers (32)
-
DataHoarder
physical vs virtual I mean
-
sech1
ah, maybe it's also that register width is 256 bit (4x scalar)
-
sech1
but execution units are 128 bit
-
DataHoarder
oh, double pump!
-
sech1
so it is 2x faster, but executes 4x fewer instructions, and some additional speedup comes from this
-
sech1
so it saves time on instruction decoding
-
DataHoarder
I'm so sad Intel fucked up AVX512 so bad on new cpus
-
DataHoarder
E cores didn't have 512 so they dropped it entirely
-
DataHoarder
AMD: here's double-pumped AVX512, next gen, full pump
-
sech1
yes, it's 512 physical in Zen 5
-
DataHoarder
Intel is trying to define AVX10 to take into account bitwidth
-
DataHoarder
though they could have just double pumped it :')
-
sech1
I had an idea to write AVX512 dataset init, but it makes little sense. It's already less than a second on 9950X with AVX256
-
DataHoarder
all the useful stuff in AVX512 that is not explicitly 512-bitwidth is so great
-
DataHoarder
and what was that OP that was slower than implementing it yourself, not just yourself with vector instructions but with scalar :D
-
DataHoarder
made people not use that at all due to slowdown
-
DataHoarder
AMD: 1 cycle execution time
-
DataHoarder
no one uses it, so just about a flex