15:01:55 Implemented vectorized dataset init for RISC-V: before https://p2pool.io/u/ceef12e3b3b4e8ea/Screenshot%20from%202025-11-30%2015-57-46.png after https://p2pool.io/u/3ef318b4f6ea6660/Screenshot%20from%202025-11-30%2016-00-21.png 15:02:18 dataset init time reduced from 28.294 s to 21.728 s 15:02:57 30% speedup, but it also includes the cache init part which didn't change 15:04:05 cache init is 5-6 seconds, approximately 15:04:25 so dataset init time reduced from ~22 to ~15 seconds 15:05:07 In theory it would be down from 22 to 11 seconds if RAM access wasn't a bottleneck 17:59:34 Got even faster after fixing all the bugs: https://p2pool.io/u/8c593a881a361d8b/Screenshot%20from%202025-11-30%2018-58-54.png 17:59:50 from 28.294 down to 17.018 seconds to init dataset 18:22:43 https://github.com/xmrig/xmrig/pull/3736 18:25:15 👏 18:25:57 Next on my plan is to write vectorized soft AES for hash/fill AES to speed up this part. After that I'll be comfortable enough with RISC-V assembly and vector instructions to add it to the actual RandomX JIT 18:26:20 (Because this CPU doesn't have hardware AES, sad) 19:18:04 `RxDataset::init` timings: 21865 ms before, 10639 ms after 19:18:13 That's more than 2x speedup, lol 19:18:22 I expected max 2x from vector code 19:21:15 they might have way more vector registers than scalar ones nowadays :D 19:22:06 I guess vector instructions are overall more efficient 19:22:19 Or it's the fact that I do prefetch instructions in vector code, and scalar code doesn't 19:22:33 It's the same number of registers (32) 19:23:54 physical vs virtual I mean 19:24:29 ah, maybe it's also that register width is 256 bit (4x scalar) 19:24:35 but execution units are 128 bit 19:24:46 oh, double pump! 19:24:52 so it is 2x faster, but executes 4x fewer instructions, and some additional speedup comes from this 19:25:11 so it saves time on instruction decoding 19:28:29 I'm so sad Intel fucked up AVX512 so bad on new cpus 19:28:46 E cores didn't have 512 so they dropped it entirely 19:29:03 AMD: here's double-pumped AVX512, next gen, full pump 19:29:18 yes, it's 512 physical in Zen 5 19:29:20 Intel is trying to define AVX10 to take into account bitwidth 19:29:32 though they could have just double pumped it :') 19:29:39 I had an idea to write AVX512 dataset init, but it makes little sense. It's already less than a second on 9950X with AVX256 19:29:48 all the useful stuff in AVX512 that is not explicitly 512-bitwidth is so great 19:30:34 and what was that OP that was slower than implementing it yourself, not just yourself with vector instructions but with scalar :D 19:30:45 made people not use that at all due to slowdown 19:30:54 AMD: 1 cycle execution time 19:31:15 no one uses it, so just about a flex