06:33:29 <paulio_uk> hmm I might grab a beagleboard for testing, not much ram, but enough to play with
06:33:50 <paulio_uk> its the c610 not c620, but worth a play
06:34:05 <paulio_uk> *c910 even
06:35:47 <elucidator> new beagleboards have riscv variant? or are you saying this for aarch64?
06:38:23 <elucidator> I always wondered if PRUs on beagleboards' sitara can be used for "accelerating" stuff but they are to slow at mere 200 MHz and there are only two of them. their best feature must be having "all instructions are single cycle"
06:38:47 <elucidator> *too slow
06:40:37 <elucidator> plus each instruction is single cycle until you realize some opcodes are pseudo-opcodes and they assemble to multiple opcodes, hence not having single cycle operation. but at least each real opcode is still all single cycle as far as my work showed. 
06:41:13 <paulio_uk> https://www.beagleboard.org/boards/beaglev-ahead
06:42:06 <paulio_uk> Processor	T-Head TH1520 (quad-core Xuantie C910 processor)
06:42:06 <paulio_uk>  Memory	4GB LPDDR4
06:42:06 <paulio_uk>  Storage	16GB eMMC
06:42:53 <elucidator> oh sweet, thanks
06:43:24 <elucidator> 150 bucks
06:45:26 <paulio_uk> yeah not too cheap, but I can get it delivered pretty quick
06:45:30 <paulio_uk> might wait til payday
06:47:49 <elucidator> tbf beagleboards were never the cheap option. granted they had good software support but kinda have that "brandind" price tag. good option for a quality riscV product ofc. 
06:49:39 <paulio_uk> 2 x 2GB LPDDR4x, 16bits x 2 channels, 2133MHz     yeah not expecting to ROI mining with it :D
06:51:37 <elucidator> dang, I ordered 1000 for a mining farm already :'(
06:51:48 <paulio_uk> :D
06:52:13 <elucidator> it's probability pretty low power tho
06:52:35 <paulio_uk> you could probably ROI with 1000 of them in a farm... just not mining :D could sell that as a "Homebuilt X6" :D
06:55:32 <paulio_uk> hmm, has anyone even received their antminer X5 yet? Or are all of the resellers holding onto them?
06:56:58 <elucidator> says 5V 2A for beagleboard. less then an rpi4 for similar specs.
06:57:19 <paulio_uk> hmm thought the pi4 was 5v 2a
06:57:48 <paulio_uk> hell, I have a couple pi3's running on 500mA (they're not doing very much though)
06:57:54 <paulio_uk> well 550mA
06:58:02 <elucidator> it's official power brick is 3A
06:58:48 <paulio_uk> guess thats for the hdmi output, audio and IO all in use
06:59:40 <elucidator> I'm just ballpark comparing the required numbers. otherwise I too am running the older PIs with less then desired amps
07:02:36 <paulio_uk> meh, decisions decisions... buy the rtlsdr kit I wanted to play with, or a beagleboard
07:03:24 <paulio_uk> in fairness all I'd do with the beagleboard is get debian up and running on it then open up a vlan'd ssh for sech1 and hyc to play with it
07:03:53 <paulio_uk> my knowledge is terrible, time would be better spent on it with someone that knows what they're doing :D
07:09:37 <sech1> If only I had time for all this :D
07:10:03 <sech1> Right now it's hard to find time even to write aarch64 code for RandomX tweaks
07:59:01 <paulio_uk> I really need to start trying to do Elon Musk work hours :| there's not enough hours in the day to do everything I need to, let alone want to :|
08:04:55 <elucidator> bad advice but in early 20s I did a month of sleeping only every other night. you get some sweet extra time but you also get nausea in the mornings whether you sleep or not and get terrible health. 
08:07:44 <paulio_uk> tis bad advice, but heck I've done it plenty and I'm in my 40s now - still occasionally do it.
08:08:10 <paulio_uk> something about the amount of work I get done through the night, almost makes up to 3 days working during the day
12:51:42 <sech1> Hmm, AES tweak is only 16 instructions on x86, but 48 instructions on aarch64 (but each aarch64 instruction should be faster because it does less stuff than x86's aesenc)
13:44:52 <m-relay> <p​olar9669:matrix.org> I still don’t understand why we need to tweak pow now ? It’s still anti asic ; why make every miner update and cause hashrate fluctuations again, even botnets won’t like it
13:45:42 <sech1> We will be tweaking PoW anyway, for unrelated reason
13:46:09 <sech1> https://github.com/monero-project/monero/issues/8827
13:49:49 <paulio_uk> meh been wondering what that number was for a while... not sure my VPS can hold out with 40gb/year blockchain increase :P struggling for space as it is
13:50:02 <paulio_uk> would have to have to prune it :| I like having a full blockchain hosted
13:51:59 <sech1> so regarding PoW tweaks: one tweak is for fast partial block verification, the CFROUND and AES tweaks are to make modern CPUs more efficient (not to brick asics). RandomX is 4 years old now, it needs to be tweaked for modern CPUs.
13:55:14 <paulio_uk> heck, if you can get my 7950x upto 40kh/s that'd be nice :)
13:55:34 <sech1> I tested 7950X, it's 7-8% speedup like other Ryzens
13:55:47 <sech1> so expect +1.5 kh/s
13:56:24 <sech1> and this increase is while doing more stuff in the RandomX loop at the same time :D
13:56:57 <sech1> and of course you'll get +1.5 kh/s if your memory is fast enough (you're not bottlenecked by memory)
14:08:26 <paulio_uk> definitely need to spend some time tuning it better
14:08:38 <paulio_uk> also need to deal with the secure boot issue in debian :/
14:09:00 <paulio_uk> can't for the life of me get MSR working - probably just going to need to sign the mod and load it in
18:05:21 <sech1> tevador https://github.com/tevador/RandomX/pull/274#issuecomment-1735950468
18:05:37 <sech1> and hyc ^
18:05:59 <sech1> my branch is ready for testing on arm64 CPUs which have AES
18:12:38 <tevador> Thanks, I'll check it out.
18:13:22 <sech1> You'll have to add "--v2" to the command line to test RandomX v2
18:24:11 <tevador> Maybe it's premature optimization, but I wanted to avoid runtime checks of the flags since the VmBase virtual function table already serves this purpose.
18:24:55 <sech1> flags are checked only 1-2 times per program because CFROUND is rare, and AES is only once per program
18:25:13 <sech1> templating so much code (the whole class) bloats the code, which is also bad
18:26:09 <sech1> CFROUND check can be optimized away with several versions of h_CFROUND function
18:26:41 <sech1> because it's called by pointer anyway, so you can just point to the correct function when initializing JitCompiler::engine
18:26:43 <tevador> ^this is what my work-in-progress solution does
18:27:02 <sech1> I think XMRig already does it - different versions for different CPU models
18:27:53 <sech1> I don't know if it makes sense for CFROUND specifically
18:28:02 <sech1> It's literally called once per program (on average)
18:28:15 <sech1> and once per program in JIT compiler too
18:29:04 <tevador> It would save 24 branches per RandomX hash. Maybe it is premature to optimize it.
18:31:35 <sech1> It doesn't hurt. Code will get a bit bigger, but only one version of h_CFROUND will be ever called after the initialization, so it won't hurt cpu code cache
18:33:19 <sech1> RANDOMX_FLAG_HARD_AES checks don't need to be optimized. They happen only 2 times per program, and they're 100% predictable by branch predictor
18:40:22 <sech1> heh, Apple M1 also has slow CFROUND: https://github.com/tevador/RandomX/pull/274#issuecomment-1736070384
20:01:52 <sech1> tevador hyc I'm experimenting with different program sizes: https://paste.debian.net/hidden/36225154/
20:02:12 <sech1> 7950X can execute 30% more RandomX instructions per clock with program size = 512
20:02:32 <sech1> Which means it's not doing anything 30% of the time with current parameters
20:02:41 <sech1> and CPU idling = CPU wasting power and losing efficiency
20:04:02 <sech1> I'm sure that X5 has balanced performance between CPU cores and memory, so they never wait for data from memory
20:04:40 <sech1> Now I'm in favor of increasing program size to increase CPUs efficiency
20:05:27 <sech1> And 7950X idles 30% of the time with overclocked and tuned memory (DDR5-6000 CL30 with tuned sub-timings). It will be even worse with slower memory.
20:07:01 <sech1> Current RandomX parameters were tuned for old CPUs from 2019. In 4 years, CPUs got 50-70% faster (per thread in RandomX), but memory latency didn't improve.
20:07:58 <sech1> Something to think about
20:08:19 <tevador> We'd have to be very careful not to exceed x86 uop cache sizes or power efficiency will go down the drain.
20:09:15 <sech1> https://en.wikichip.org/wiki/amd/microarchitectures/zen_4#Key_changes_from_Zen_3
20:09:24 <sech1> Op cache size increased from 4,096 to 6,750 Ops per core
20:09:36 <sech1> Zen5 is expected to have 15-20% IPC increase yet again
20:09:56 <sech1> and it will do it with the same cache sizes, so it will be again wider and faster core
20:10:25 <tevador> Light verification time is another concern. We're already increasing it for soft-AES systems by 10%.
20:10:43 <sech1> Intel CPUs also increase opcache and other caches in each generation
20:11:17 <sech1> I'm not saying to increase program size to 512 :D
20:11:23 <sech1> 288 or 320 seems more realistic
20:12:42 <tevador> For psychological reasons, I think it should be possible to increase a bit to compensate for the hashrate boost of AMD CPUs.
20:13:05 <tevador> Miners don't like the "number to go down".
20:14:12 <sech1> 288 (v2) is still a bit faster than 256 (v1) on 7950X
20:15:19 <sech1> https://paste.debian.net/hidden/c3776d82/
20:15:32 <sech1> v1 is 1635 h/s
20:16:36 <tevador> What is the isn/s increase from 256 to 288?
20:16:55 <sech1> +7.2%
20:18:22 <tevador> You might have to subtract the time to run RandomX without any instructions and dataset reads.
20:18:38 <sech1> v1 is 1635 h/s = 6.857e9 ins/s, v2 (288) is 1670.5 h/s = 7.882e9 ins/s, so overall almost 15% increase
20:18:46 <tevador> i think I tested this one time but forgot the numbers.
20:19:12 <sech1> IIRC the actual RandomX instructions take more than 90% of the hash time
20:23:00 <sech1> If we assume the same 90% for both versions, the relative difference will stay the same
20:23:40 <sech1> Because they have similar hashrates, and non-RandomX stuff is exactly the same in both versions
20:24:42 <sech1> Which means 7.882/6.857 will turn into (7.882/0.9)/(6.857/0.9) and the end result is the same: 15% increase. If my math is correct
20:25:28 <sech1> Hmm, I think it's not correct :D
20:26:19 <tevador> No, it will be a bit less. It's like you are increasing from ~280 to ~312 instead of from 256 to 288.
20:27:18 <sech1> from 256 to 288 was 7.2% increase. I just added increase from CFROUND to get the final 15%
20:27:39 <sech1> 7.2% increase is both are RandomX v2
20:27:42 <sech1> *if
20:30:33 <sech1> T(hash) = T(fill scratchpad) + T(JIT), right? If 256 (v1) and 288 (v2) have almost the same T(hash), and T(fill scratchpad) is the same because no changes there, then it means T(JIT) is also the same, so we can just do 288/256 and be done :D
20:31:06 <sech1> 288/256 = +12.5%, 312/280 = +11.4%
20:31:14 <sech1> so we came to the same conclusion by two different methods
20:45:46 <sech1> I just realized that keeping v2 hashrates the same (or a bit lower) is important to not increase dependency on fast tuned RAM. Higher hashrates require better RAM timings and better tuning.
20:47:10 <tevador> While we're at it, we could also tweak the instruction frequencies...
20:47:35 <sech1> I don't have any ideas on what to tweak there yet
20:47:40 <sech1> Seems pretty balanced
20:48:43 <tevador> It might be a good idea to have benchmarks of programs filled entirely with one type of instruction.
20:51:09 <sech1> I would increase FMUL_R frequency a bit, because it's the most energy-burning instruction where ASICs would have minimal advantage
20:51:19 <sech1> and reduce IXOR_R frequency proportionally
20:51:33 <sech1> +4 for FMUL_R, -4 for IXOR_R
20:54:39 <tevador> I'd have to check if we're not getting too many infinities. Btw that also applies to just increasing the isn count per iteration.
20:55:56 <sech1> yes
20:57:18 <sech1> +4 for FMUL_R, -4 for IXOR_R even increased hashrate a bit - from 1670 to 1676 h/s (v2 288)
20:58:09 <tevador> There are probably tighter dependency chains between integer registers compared to floating point.
20:58:54 <tevador> But on the other hand, one of the audits warned that we don't have enough dependencies in the code.
20:59:30 <sech1> That's good, we don't want too many dependencies. Future CPUs will get wider and wider cores
21:00:32 <tevador> With fewer dependency chains, we should probably increase CBRANCH a bit to kill VLIW designs.
21:02:53 <sech1> Branch is every 10 instructions on average now. If we make it 9, it's still enough space for VLIW
21:04:53 <sech1> to kill VLIW, we need branches every 3-4 instructions, and that's a bit extreme
21:05:58 <sech1> it's better to increase energy burned from instruction execution (FMUL_R is the best), because you can't avoid executing it in any CPU design. VLIW only saves energy on instruction scheduling, not instruction execution
21:34:20 <tevador> Btw, I managed to trim down the risc-v FDIV_M isn down to 56 bytes, which is quite impressive for scalar code. ARM needs 48 bytes with vector code. I'm estimating that risc-v vector equivalent will be 32 bytes, same as x86.