06:33:29 hmm I might grab a beagleboard for testing, not much ram, but enough to play with 06:33:50 its the c610 not c620, but worth a play 06:34:05 *c910 even 06:35:47 new beagleboards have riscv variant? or are you saying this for aarch64? 06:38:23 I always wondered if PRUs on beagleboards' sitara can be used for "accelerating" stuff but they are to slow at mere 200 MHz and there are only two of them. their best feature must be having "all instructions are single cycle" 06:38:47 *too slow 06:40:37 plus each instruction is single cycle until you realize some opcodes are pseudo-opcodes and they assemble to multiple opcodes, hence not having single cycle operation. but at least each real opcode is still all single cycle as far as my work showed. 06:41:13 https://www.beagleboard.org/boards/beaglev-ahead 06:42:06 Processor T-Head TH1520 (quad-core Xuantie C910 processor) 06:42:06 Memory 4GB LPDDR4 06:42:06 Storage 16GB eMMC 06:42:53 oh sweet, thanks 06:43:24 150 bucks 06:45:26 yeah not too cheap, but I can get it delivered pretty quick 06:45:30 might wait til payday 06:47:49 tbf beagleboards were never the cheap option. granted they had good software support but kinda have that "brandind" price tag. good option for a quality riscV product ofc. 06:49:39 2 x 2GB LPDDR4x, 16bits x 2 channels, 2133MHz yeah not expecting to ROI mining with it :D 06:51:37 dang, I ordered 1000 for a mining farm already :'( 06:51:48 :D 06:52:13 it's probability pretty low power tho 06:52:35 you could probably ROI with 1000 of them in a farm... just not mining :D could sell that as a "Homebuilt X6" :D 06:55:32 hmm, has anyone even received their antminer X5 yet? Or are all of the resellers holding onto them? 06:56:58 says 5V 2A for beagleboard. less then an rpi4 for similar specs. 06:57:19 hmm thought the pi4 was 5v 2a 06:57:48 hell, I have a couple pi3's running on 500mA (they're not doing very much though) 06:57:54 well 550mA 06:58:02 it's official power brick is 3A 06:58:48 guess thats for the hdmi output, audio and IO all in use 06:59:40 I'm just ballpark comparing the required numbers. otherwise I too am running the older PIs with less then desired amps 07:02:36 meh, decisions decisions... buy the rtlsdr kit I wanted to play with, or a beagleboard 07:03:24 in fairness all I'd do with the beagleboard is get debian up and running on it then open up a vlan'd ssh for sech1 and hyc to play with it 07:03:53 my knowledge is terrible, time would be better spent on it with someone that knows what they're doing :D 07:09:37 If only I had time for all this :D 07:10:03 Right now it's hard to find time even to write aarch64 code for RandomX tweaks 07:59:01 I really need to start trying to do Elon Musk work hours :| there's not enough hours in the day to do everything I need to, let alone want to :| 08:04:55 bad advice but in early 20s I did a month of sleeping only every other night. you get some sweet extra time but you also get nausea in the mornings whether you sleep or not and get terrible health. 08:07:44 tis bad advice, but heck I've done it plenty and I'm in my 40s now - still occasionally do it. 08:08:10 something about the amount of work I get done through the night, almost makes up to 3 days working during the day 12:51:42 Hmm, AES tweak is only 16 instructions on x86, but 48 instructions on aarch64 (but each aarch64 instruction should be faster because it does less stuff than x86's aesenc) 13:44:52 I still don’t understand why we need to tweak pow now ? It’s still anti asic ; why make every miner update and cause hashrate fluctuations again, even botnets won’t like it 13:45:42 We will be tweaking PoW anyway, for unrelated reason 13:46:09 https://github.com/monero-project/monero/issues/8827 13:49:49 meh been wondering what that number was for a while... not sure my VPS can hold out with 40gb/year blockchain increase :P struggling for space as it is 13:50:02 would have to have to prune it :| I like having a full blockchain hosted 13:51:59 so regarding PoW tweaks: one tweak is for fast partial block verification, the CFROUND and AES tweaks are to make modern CPUs more efficient (not to brick asics). RandomX is 4 years old now, it needs to be tweaked for modern CPUs. 13:55:14 heck, if you can get my 7950x upto 40kh/s that'd be nice :) 13:55:34 I tested 7950X, it's 7-8% speedup like other Ryzens 13:55:47 so expect +1.5 kh/s 13:56:24 and this increase is while doing more stuff in the RandomX loop at the same time :D 13:56:57 and of course you'll get +1.5 kh/s if your memory is fast enough (you're not bottlenecked by memory) 14:08:26 definitely need to spend some time tuning it better 14:08:38 also need to deal with the secure boot issue in debian :/ 14:09:00 can't for the life of me get MSR working - probably just going to need to sign the mod and load it in 18:05:21 tevador https://github.com/tevador/RandomX/pull/274#issuecomment-1735950468 18:05:37 and hyc ^ 18:05:59 my branch is ready for testing on arm64 CPUs which have AES 18:12:38 Thanks, I'll check it out. 18:13:22 You'll have to add "--v2" to the command line to test RandomX v2 18:24:11 Maybe it's premature optimization, but I wanted to avoid runtime checks of the flags since the VmBase virtual function table already serves this purpose. 18:24:55 flags are checked only 1-2 times per program because CFROUND is rare, and AES is only once per program 18:25:13 templating so much code (the whole class) bloats the code, which is also bad 18:26:09 CFROUND check can be optimized away with several versions of h_CFROUND function 18:26:41 because it's called by pointer anyway, so you can just point to the correct function when initializing JitCompiler::engine 18:26:43 ^this is what my work-in-progress solution does 18:27:02 I think XMRig already does it - different versions for different CPU models 18:27:53 I don't know if it makes sense for CFROUND specifically 18:28:02 It's literally called once per program (on average) 18:28:15 and once per program in JIT compiler too 18:29:04 It would save 24 branches per RandomX hash. Maybe it is premature to optimize it. 18:31:35 It doesn't hurt. Code will get a bit bigger, but only one version of h_CFROUND will be ever called after the initialization, so it won't hurt cpu code cache 18:33:19 RANDOMX_FLAG_HARD_AES checks don't need to be optimized. They happen only 2 times per program, and they're 100% predictable by branch predictor 18:40:22 heh, Apple M1 also has slow CFROUND: https://github.com/tevador/RandomX/pull/274#issuecomment-1736070384 20:01:52 tevador hyc I'm experimenting with different program sizes: https://paste.debian.net/hidden/36225154/ 20:02:12 7950X can execute 30% more RandomX instructions per clock with program size = 512 20:02:32 Which means it's not doing anything 30% of the time with current parameters 20:02:41 and CPU idling = CPU wasting power and losing efficiency 20:04:02 I'm sure that X5 has balanced performance between CPU cores and memory, so they never wait for data from memory 20:04:40 Now I'm in favor of increasing program size to increase CPUs efficiency 20:05:27 And 7950X idles 30% of the time with overclocked and tuned memory (DDR5-6000 CL30 with tuned sub-timings). It will be even worse with slower memory. 20:07:01 Current RandomX parameters were tuned for old CPUs from 2019. In 4 years, CPUs got 50-70% faster (per thread in RandomX), but memory latency didn't improve. 20:07:58 Something to think about 20:08:19 We'd have to be very careful not to exceed x86 uop cache sizes or power efficiency will go down the drain. 20:09:15 https://en.wikichip.org/wiki/amd/microarchitectures/zen_4#Key_changes_from_Zen_3 20:09:24 Op cache size increased from 4,096 to 6,750 Ops per core 20:09:36 Zen5 is expected to have 15-20% IPC increase yet again 20:09:56 and it will do it with the same cache sizes, so it will be again wider and faster core 20:10:25 Light verification time is another concern. We're already increasing it for soft-AES systems by 10%. 20:10:43 Intel CPUs also increase opcache and other caches in each generation 20:11:17 I'm not saying to increase program size to 512 :D 20:11:23 288 or 320 seems more realistic 20:12:42 For psychological reasons, I think it should be possible to increase a bit to compensate for the hashrate boost of AMD CPUs. 20:13:05 Miners don't like the "number to go down". 20:14:12 288 (v2) is still a bit faster than 256 (v1) on 7950X 20:15:19 https://paste.debian.net/hidden/c3776d82/ 20:15:32 v1 is 1635 h/s 20:16:36 What is the isn/s increase from 256 to 288? 20:16:55 +7.2% 20:18:22 You might have to subtract the time to run RandomX without any instructions and dataset reads. 20:18:38 v1 is 1635 h/s = 6.857e9 ins/s, v2 (288) is 1670.5 h/s = 7.882e9 ins/s, so overall almost 15% increase 20:18:46 i think I tested this one time but forgot the numbers. 20:19:12 IIRC the actual RandomX instructions take more than 90% of the hash time 20:23:00 If we assume the same 90% for both versions, the relative difference will stay the same 20:23:40 Because they have similar hashrates, and non-RandomX stuff is exactly the same in both versions 20:24:42 Which means 7.882/6.857 will turn into (7.882/0.9)/(6.857/0.9) and the end result is the same: 15% increase. If my math is correct 20:25:28 Hmm, I think it's not correct :D 20:26:19 No, it will be a bit less. It's like you are increasing from ~280 to ~312 instead of from 256 to 288. 20:27:18 from 256 to 288 was 7.2% increase. I just added increase from CFROUND to get the final 15% 20:27:39 7.2% increase is both are RandomX v2 20:27:42 *if 20:30:33 T(hash) = T(fill scratchpad) + T(JIT), right? If 256 (v1) and 288 (v2) have almost the same T(hash), and T(fill scratchpad) is the same because no changes there, then it means T(JIT) is also the same, so we can just do 288/256 and be done :D 20:31:06 288/256 = +12.5%, 312/280 = +11.4% 20:31:14 so we came to the same conclusion by two different methods 20:45:46 I just realized that keeping v2 hashrates the same (or a bit lower) is important to not increase dependency on fast tuned RAM. Higher hashrates require better RAM timings and better tuning. 20:47:10 While we're at it, we could also tweak the instruction frequencies... 20:47:35 I don't have any ideas on what to tweak there yet 20:47:40 Seems pretty balanced 20:48:43 It might be a good idea to have benchmarks of programs filled entirely with one type of instruction. 20:51:09 I would increase FMUL_R frequency a bit, because it's the most energy-burning instruction where ASICs would have minimal advantage 20:51:19 and reduce IXOR_R frequency proportionally 20:51:33 +4 for FMUL_R, -4 for IXOR_R 20:54:39 I'd have to check if we're not getting too many infinities. Btw that also applies to just increasing the isn count per iteration. 20:55:56 yes 20:57:18 +4 for FMUL_R, -4 for IXOR_R even increased hashrate a bit - from 1670 to 1676 h/s (v2 288) 20:58:09 There are probably tighter dependency chains between integer registers compared to floating point. 20:58:54 But on the other hand, one of the audits warned that we don't have enough dependencies in the code. 20:59:30 That's good, we don't want too many dependencies. Future CPUs will get wider and wider cores 21:00:32 With fewer dependency chains, we should probably increase CBRANCH a bit to kill VLIW designs. 21:02:53 Branch is every 10 instructions on average now. If we make it 9, it's still enough space for VLIW 21:04:53 to kill VLIW, we need branches every 3-4 instructions, and that's a bit extreme 21:05:58 it's better to increase energy burned from instruction execution (FMUL_R is the best), because you can't avoid executing it in any CPU design. VLIW only saves energy on instruction scheduling, not instruction execution 21:34:20 Btw, I managed to trim down the risc-v FDIV_M isn down to 56 bytes, which is quite impressive for scalar code. ARM needs 48 bytes with vector code. I'm estimating that risc-v vector equivalent will be 32 bytes, same as x86.