#monero-pow

06:33

paulio_uk

hmm I might grab a beagleboard for testing, not much ram, but enough to play with
06:33

paulio_uk

its the c610 not c620, but worth a play
06:34

paulio_uk

*c910 even
06:35

elucidator

new beagleboards have riscv variant? or are you saying this for aarch64?
06:38

elucidator

I always wondered if PRUs on beagleboards' sitara can be used for "accelerating" stuff but they are to slow at mere 200 MHz and there are only two of them. their best feature must be having "all instructions are single cycle"
06:38

elucidator

*too slow
06:40

elucidator

plus each instruction is single cycle until you realize some opcodes are pseudo-opcodes and they assemble to multiple opcodes, hence not having single cycle operation. but at least each real opcode is still all single cycle as far as my work showed.
06:41

paulio_uk

beagleboard.org/boards/beaglev-ahead
06:42

paulio_uk

Processor T-Head TH1520 (quad-core Xuantie C910 processor)
06:42

paulio_uk

Memory 4GB LPDDR4
06:42

paulio_uk

Storage 16GB eMMC
06:42

elucidator

oh sweet, thanks
06:43

elucidator

150 bucks
06:45

paulio_uk

yeah not too cheap, but I can get it delivered pretty quick
06:45

paulio_uk

might wait til payday
06:47

elucidator

tbf beagleboards were never the cheap option. granted they had good software support but kinda have that "brandind" price tag. good option for a quality riscV product ofc.
06:49

paulio_uk

2 x 2GB LPDDR4x, 16bits x 2 channels, 2133MHz yeah not expecting to ROI mining with it :D
06:51

elucidator

dang, I ordered 1000 for a mining farm already :'(
06:51

paulio_uk

:D
06:52

elucidator

it's probability pretty low power tho
06:52

paulio_uk

you could probably ROI with 1000 of them in a farm... just not mining :D could sell that as a "Homebuilt X6" :D
06:55

paulio_uk

hmm, has anyone even received their antminer X5 yet? Or are all of the resellers holding onto them?
06:56

elucidator

says 5V 2A for beagleboard. less then an rpi4 for similar specs.
06:57

paulio_uk

hmm thought the pi4 was 5v 2a
06:57

paulio_uk

hell, I have a couple pi3's running on 500mA (they're not doing very much though)
06:57

paulio_uk

well 550mA
06:58

elucidator

it's official power brick is 3A
06:58

paulio_uk

guess thats for the hdmi output, audio and IO all in use
06:59

elucidator

I'm just ballpark comparing the required numbers. otherwise I too am running the older PIs with less then desired amps
07:02

paulio_uk

meh, decisions decisions... buy the rtlsdr kit I wanted to play with, or a beagleboard
07:03

paulio_uk

in fairness all I'd do with the beagleboard is get debian up and running on it then open up a vlan'd ssh for sech1 and hyc to play with it
07:03

paulio_uk

my knowledge is terrible, time would be better spent on it with someone that knows what they're doing :D
07:09

sech1

If only I had time for all this :D
07:10

sech1

Right now it's hard to find time even to write aarch64 code for RandomX tweaks
07:59

paulio_uk

I really need to start trying to do Elon Musk work hours :| there's not enough hours in the day to do everything I need to, let alone want to :|
08:04

elucidator

bad advice but in early 20s I did a month of sleeping only every other night. you get some sweet extra time but you also get nausea in the mornings whether you sleep or not and get terrible health.
08:07

paulio_uk

tis bad advice, but heck I've done it plenty and I'm in my 40s now - still occasionally do it.
08:08

paulio_uk

something about the amount of work I get done through the night, almost makes up to 3 days working during the day
12:51

sech1

Hmm, AES tweak is only 16 instructions on x86, but 48 instructions on aarch64 (but each aarch64 instruction should be faster because it does less stuff than x86's aesenc)
13:44

m-relay

<polar9669:matrix.org> I still don’t understand why we need to tweak pow now ? It’s still anti asic ; why make every miner update and cause hashrate fluctuations again, even botnets won’t like it
13:45

sech1

We will be tweaking PoW anyway, for unrelated reason
13:46

sech1

monero-project/monero #8827
13:49

paulio_uk

meh been wondering what that number was for a while... not sure my VPS can hold out with 40gb/year blockchain increase :P struggling for space as it is
13:50

paulio_uk

would have to have to prune it :| I like having a full blockchain hosted
13:51

sech1

so regarding PoW tweaks: one tweak is for fast partial block verification, the CFROUND and AES tweaks are to make modern CPUs more efficient (not to brick asics). RandomX is 4 years old now, it needs to be tweaked for modern CPUs.
13:55

paulio_uk

heck, if you can get my 7950x upto 40kh/s that'd be nice :)
13:55

sech1

I tested 7950X, it's 7-8% speedup like other Ryzens
13:55

sech1

so expect +1.5 kh/s
13:56

sech1

and this increase is while doing more stuff in the RandomX loop at the same time :D
13:56

sech1

and of course you'll get +1.5 kh/s if your memory is fast enough (you're not bottlenecked by memory)
14:08

paulio_uk

definitely need to spend some time tuning it better
14:08

paulio_uk

also need to deal with the secure boot issue in debian :/
14:09

paulio_uk

can't for the life of me get MSR working - probably just going to need to sign the mod and load it in
18:05

sech1

tevador tevador/RandomX #274#issuecomment-1735950468
18:05

sech1

and hyc ^
18:05

sech1

my branch is ready for testing on arm64 CPUs which have AES
18:12

tevador

Thanks, I'll check it out.
18:13

sech1

You'll have to add "--v2" to the command line to test RandomX v2
18:24

tevador

Maybe it's premature optimization, but I wanted to avoid runtime checks of the flags since the VmBase virtual function table already serves this purpose.
18:24

sech1

flags are checked only 1-2 times per program because CFROUND is rare, and AES is only once per program
18:25

sech1

templating so much code (the whole class) bloats the code, which is also bad
18:26

sech1

CFROUND check can be optimized away with several versions of h_CFROUND function
18:26

sech1

because it's called by pointer anyway, so you can just point to the correct function when initializing JitCompiler::engine
18:26

tevador

^this is what my work-in-progress solution does
18:27

sech1

I think XMRig already does it - different versions for different CPU models
18:27

sech1

I don't know if it makes sense for CFROUND specifically
18:28

sech1

It's literally called once per program (on average)
18:28

sech1

and once per program in JIT compiler too
18:29

tevador

It would save 24 branches per RandomX hash. Maybe it is premature to optimize it.
18:31

sech1

It doesn't hurt. Code will get a bit bigger, but only one version of h_CFROUND will be ever called after the initialization, so it won't hurt cpu code cache
18:33

sech1

RANDOMX_FLAG_HARD_AES checks don't need to be optimized. They happen only 2 times per program, and they're 100% predictable by branch predictor
18:40

sech1

heh, Apple M1 also has slow CFROUND: tevador/RandomX #274#issuecomment-1736070384
20:01

sech1

tevador hyc I'm experimenting with different program sizes: paste.debian.net/hidden/36225154
20:02

sech1

7950X can execute 30% more RandomX instructions per clock with program size = 512
20:02

sech1

Which means it's not doing anything 30% of the time with current parameters
20:02

sech1

and CPU idling = CPU wasting power and losing efficiency
20:04

sech1

I'm sure that X5 has balanced performance between CPU cores and memory, so they never wait for data from memory
20:04

sech1

Now I'm in favor of increasing program size to increase CPUs efficiency
20:05

sech1

And 7950X idles 30% of the time with overclocked and tuned memory (DDR5-6000 CL30 with tuned sub-timings). It will be even worse with slower memory.
20:07

sech1

Current RandomX parameters were tuned for old CPUs from 2019. In 4 years, CPUs got 50-70% faster (per thread in RandomX), but memory latency didn't improve.
20:07

sech1

Something to think about
20:08

tevador

We'd have to be very careful not to exceed x86 uop cache sizes or power efficiency will go down the drain.
20:09

sech1

en.wikichip.org/wiki/amd/microarchi…ctures/zen_4#Key_changes_from_Zen_3
20:09

sech1

Op cache size increased from 4,096 to 6,750 Ops per core
20:09

sech1

Zen5 is expected to have 15-20% IPC increase yet again
20:09

sech1

and it will do it with the same cache sizes, so it will be again wider and faster core
20:10

tevador

Light verification time is another concern. We're already increasing it for soft-AES systems by 10%.
20:10

sech1

Intel CPUs also increase opcache and other caches in each generation
20:11

sech1

I'm not saying to increase program size to 512 :D
20:11

sech1

288 or 320 seems more realistic
20:12

tevador

For psychological reasons, I think it should be possible to increase a bit to compensate for the hashrate boost of AMD CPUs.
20:13

tevador

Miners don't like the "number to go down".
20:14

sech1

288 (v2) is still a bit faster than 256 (v1) on 7950X
20:15

sech1

paste.debian.net/hidden/c3776d82
20:15

sech1

v1 is 1635 h/s
20:16

tevador

What is the isn/s increase from 256 to 288?
20:16

sech1

+7.2%
20:18

tevador

You might have to subtract the time to run RandomX without any instructions and dataset reads.
20:18

sech1

v1 is 1635 h/s = 6.857e9 ins/s, v2 (288) is 1670.5 h/s = 7.882e9 ins/s, so overall almost 15% increase
20:18

tevador

i think I tested this one time but forgot the numbers.
20:19

sech1

IIRC the actual RandomX instructions take more than 90% of the hash time
20:23

sech1

If we assume the same 90% for both versions, the relative difference will stay the same
20:23

sech1

Because they have similar hashrates, and non-RandomX stuff is exactly the same in both versions
20:24

sech1

Which means 7.882/6.857 will turn into (7.882/0.9)/(6.857/0.9) and the end result is the same: 15% increase. If my math is correct
20:25

sech1

Hmm, I think it's not correct :D
20:26

tevador

No, it will be a bit less. It's like you are increasing from ~280 to ~312 instead of from 256 to 288.
20:27

sech1

from 256 to 288 was 7.2% increase. I just added increase from CFROUND to get the final 15%
20:27

sech1

7.2% increase is both are RandomX v2
20:27

sech1

*if
20:30

sech1

T(hash) = T(fill scratchpad) + T(JIT), right? If 256 (v1) and 288 (v2) have almost the same T(hash), and T(fill scratchpad) is the same because no changes there, then it means T(JIT) is also the same, so we can just do 288/256 and be done :D
20:31

sech1

288/256 = +12.5%, 312/280 = +11.4%
20:31

sech1

so we came to the same conclusion by two different methods
20:45

sech1

I just realized that keeping v2 hashrates the same (or a bit lower) is important to not increase dependency on fast tuned RAM. Higher hashrates require better RAM timings and better tuning.
20:47

tevador

While we're at it, we could also tweak the instruction frequencies...
20:47

sech1

I don't have any ideas on what to tweak there yet
20:47

sech1

Seems pretty balanced
20:48

tevador

It might be a good idea to have benchmarks of programs filled entirely with one type of instruction.
20:51

sech1

I would increase FMUL_R frequency a bit, because it's the most energy-burning instruction where ASICs would have minimal advantage
20:51

sech1

and reduce IXOR_R frequency proportionally
20:51

sech1

+4 for FMUL_R, -4 for IXOR_R
20:54

tevador

I'd have to check if we're not getting too many infinities. Btw that also applies to just increasing the isn count per iteration.
20:55

sech1

yes
20:57

sech1

+4 for FMUL_R, -4 for IXOR_R even increased hashrate a bit - from 1670 to 1676 h/s (v2 288)
20:58

tevador

There are probably tighter dependency chains between integer registers compared to floating point.
20:58

tevador

But on the other hand, one of the audits warned that we don't have enough dependencies in the code.
20:59

sech1

That's good, we don't want too many dependencies. Future CPUs will get wider and wider cores
21:00

tevador

With fewer dependency chains, we should probably increase CBRANCH a bit to kill VLIW designs.
21:02

sech1

Branch is every 10 instructions on average now. If we make it 9, it's still enough space for VLIW
21:04

sech1

to kill VLIW, we need branches every 3-4 instructions, and that's a bit extreme
21:05

sech1

it's better to increase energy burned from instruction execution (FMUL_R is the best), because you can't avoid executing it in any CPU design. VLIW only saves energy on instruction scheduling, not instruction execution
21:34

tevador

Btw, I managed to trim down the risc-v FDIV_M isn down to 56 bytes, which is quite impressive for scalar code. ARM needs 48 bytes with vector code. I'm estimating that risc-v vector equivalent will be 32 bytes, same as x86.

2 years ago

« a day earlier

a day later »

today »