-
paulio_uk
hmm I might grab a beagleboard for testing, not much ram, but enough to play with
-
paulio_uk
its the c610 not c620, but worth a play
-
paulio_uk
*c910 even
-
elucidator
new beagleboards have riscv variant? or are you saying this for aarch64?
-
elucidator
I always wondered if PRUs on beagleboards' sitara can be used for "accelerating" stuff but they are to slow at mere 200 MHz and there are only two of them. their best feature must be having "all instructions are single cycle"
-
elucidator
*too slow
-
elucidator
plus each instruction is single cycle until you realize some opcodes are pseudo-opcodes and they assemble to multiple opcodes, hence not having single cycle operation. but at least each real opcode is still all single cycle as far as my work showed.
-
paulio_uk
-
paulio_uk
Processor T-Head TH1520 (quad-core Xuantie C910 processor)
-
paulio_uk
Memory 4GB LPDDR4
-
paulio_uk
Storage 16GB eMMC
-
elucidator
oh sweet, thanks
-
elucidator
150 bucks
-
paulio_uk
yeah not too cheap, but I can get it delivered pretty quick
-
paulio_uk
might wait til payday
-
elucidator
tbf beagleboards were never the cheap option. granted they had good software support but kinda have that "brandind" price tag. good option for a quality riscV product ofc.
-
paulio_uk
2 x 2GB LPDDR4x, 16bits x 2 channels, 2133MHz yeah not expecting to ROI mining with it :D
-
elucidator
dang, I ordered 1000 for a mining farm already :'(
-
paulio_uk
:D
-
elucidator
it's probability pretty low power tho
-
paulio_uk
you could probably ROI with 1000 of them in a farm... just not mining :D could sell that as a "Homebuilt X6" :D
-
paulio_uk
hmm, has anyone even received their antminer X5 yet? Or are all of the resellers holding onto them?
-
elucidator
says 5V 2A for beagleboard. less then an rpi4 for similar specs.
-
paulio_uk
hmm thought the pi4 was 5v 2a
-
paulio_uk
hell, I have a couple pi3's running on 500mA (they're not doing very much though)
-
paulio_uk
well 550mA
-
elucidator
it's official power brick is 3A
-
paulio_uk
guess thats for the hdmi output, audio and IO all in use
-
elucidator
I'm just ballpark comparing the required numbers. otherwise I too am running the older PIs with less then desired amps
-
paulio_uk
meh, decisions decisions... buy the rtlsdr kit I wanted to play with, or a beagleboard
-
paulio_uk
in fairness all I'd do with the beagleboard is get debian up and running on it then open up a vlan'd ssh for sech1 and hyc to play with it
-
paulio_uk
my knowledge is terrible, time would be better spent on it with someone that knows what they're doing :D
-
sech1
If only I had time for all this :D
-
sech1
Right now it's hard to find time even to write aarch64 code for RandomX tweaks
-
paulio_uk
I really need to start trying to do Elon Musk work hours :| there's not enough hours in the day to do everything I need to, let alone want to :|
-
elucidator
bad advice but in early 20s I did a month of sleeping only every other night. you get some sweet extra time but you also get nausea in the mornings whether you sleep or not and get terrible health.
-
paulio_uk
tis bad advice, but heck I've done it plenty and I'm in my 40s now - still occasionally do it.
-
paulio_uk
something about the amount of work I get done through the night, almost makes up to 3 days working during the day
-
sech1
Hmm, AES tweak is only 16 instructions on x86, but 48 instructions on aarch64 (but each aarch64 instruction should be faster because it does less stuff than x86's aesenc)
-
m-relay
<polar9669:matrix.org> I still don’t understand why we need to tweak pow now ? It’s still anti asic ; why make every miner update and cause hashrate fluctuations again, even botnets won’t like it
-
sech1
We will be tweaking PoW anyway, for unrelated reason
-
sech1
-
paulio_uk
meh been wondering what that number was for a while... not sure my VPS can hold out with 40gb/year blockchain increase :P struggling for space as it is
-
paulio_uk
would have to have to prune it :| I like having a full blockchain hosted
-
sech1
so regarding PoW tweaks: one tweak is for fast partial block verification, the CFROUND and AES tweaks are to make modern CPUs more efficient (not to brick asics). RandomX is 4 years old now, it needs to be tweaked for modern CPUs.
-
paulio_uk
heck, if you can get my 7950x upto 40kh/s that'd be nice :)
-
sech1
I tested 7950X, it's 7-8% speedup like other Ryzens
-
sech1
so expect +1.5 kh/s
-
sech1
and this increase is while doing more stuff in the RandomX loop at the same time :D
-
sech1
and of course you'll get +1.5 kh/s if your memory is fast enough (you're not bottlenecked by memory)
-
paulio_uk
definitely need to spend some time tuning it better
-
paulio_uk
also need to deal with the secure boot issue in debian :/
-
paulio_uk
can't for the life of me get MSR working - probably just going to need to sign the mod and load it in
-
sech1
-
sech1
and hyc ^
-
sech1
my branch is ready for testing on arm64 CPUs which have AES
-
tevador
Thanks, I'll check it out.
-
sech1
You'll have to add "--v2" to the command line to test RandomX v2
-
tevador
Maybe it's premature optimization, but I wanted to avoid runtime checks of the flags since the VmBase virtual function table already serves this purpose.
-
sech1
flags are checked only 1-2 times per program because CFROUND is rare, and AES is only once per program
-
sech1
templating so much code (the whole class) bloats the code, which is also bad
-
sech1
CFROUND check can be optimized away with several versions of h_CFROUND function
-
sech1
because it's called by pointer anyway, so you can just point to the correct function when initializing JitCompiler::engine
-
tevador
^this is what my work-in-progress solution does
-
sech1
I think XMRig already does it - different versions for different CPU models
-
sech1
I don't know if it makes sense for CFROUND specifically
-
sech1
It's literally called once per program (on average)
-
sech1
and once per program in JIT compiler too
-
tevador
It would save 24 branches per RandomX hash. Maybe it is premature to optimize it.
-
sech1
It doesn't hurt. Code will get a bit bigger, but only one version of h_CFROUND will be ever called after the initialization, so it won't hurt cpu code cache
-
sech1
RANDOMX_FLAG_HARD_AES checks don't need to be optimized. They happen only 2 times per program, and they're 100% predictable by branch predictor
-
sech1
-
sech1
tevador hyc I'm experimenting with different program sizes:
paste.debian.net/hidden/36225154
-
sech1
7950X can execute 30% more RandomX instructions per clock with program size = 512
-
sech1
Which means it's not doing anything 30% of the time with current parameters
-
sech1
and CPU idling = CPU wasting power and losing efficiency
-
sech1
I'm sure that X5 has balanced performance between CPU cores and memory, so they never wait for data from memory
-
sech1
Now I'm in favor of increasing program size to increase CPUs efficiency
-
sech1
And 7950X idles 30% of the time with overclocked and tuned memory (DDR5-6000 CL30 with tuned sub-timings). It will be even worse with slower memory.
-
sech1
Current RandomX parameters were tuned for old CPUs from 2019. In 4 years, CPUs got 50-70% faster (per thread in RandomX), but memory latency didn't improve.
-
sech1
Something to think about
-
tevador
We'd have to be very careful not to exceed x86 uop cache sizes or power efficiency will go down the drain.
-
sech1
-
sech1
Op cache size increased from 4,096 to 6,750 Ops per core
-
sech1
Zen5 is expected to have 15-20% IPC increase yet again
-
sech1
and it will do it with the same cache sizes, so it will be again wider and faster core
-
tevador
Light verification time is another concern. We're already increasing it for soft-AES systems by 10%.
-
sech1
Intel CPUs also increase opcache and other caches in each generation
-
sech1
I'm not saying to increase program size to 512 :D
-
sech1
288 or 320 seems more realistic
-
tevador
For psychological reasons, I think it should be possible to increase a bit to compensate for the hashrate boost of AMD CPUs.
-
tevador
Miners don't like the "number to go down".
-
sech1
288 (v2) is still a bit faster than 256 (v1) on 7950X
-
sech1
-
sech1
v1 is 1635 h/s
-
tevador
What is the isn/s increase from 256 to 288?
-
sech1
+7.2%
-
tevador
You might have to subtract the time to run RandomX without any instructions and dataset reads.
-
sech1
v1 is 1635 h/s = 6.857e9 ins/s, v2 (288) is 1670.5 h/s = 7.882e9 ins/s, so overall almost 15% increase
-
tevador
i think I tested this one time but forgot the numbers.
-
sech1
IIRC the actual RandomX instructions take more than 90% of the hash time
-
sech1
If we assume the same 90% for both versions, the relative difference will stay the same
-
sech1
Because they have similar hashrates, and non-RandomX stuff is exactly the same in both versions
-
sech1
Which means 7.882/6.857 will turn into (7.882/0.9)/(6.857/0.9) and the end result is the same: 15% increase. If my math is correct
-
sech1
Hmm, I think it's not correct :D
-
tevador
No, it will be a bit less. It's like you are increasing from ~280 to ~312 instead of from 256 to 288.
-
sech1
from 256 to 288 was 7.2% increase. I just added increase from CFROUND to get the final 15%
-
sech1
7.2% increase is both are RandomX v2
-
sech1
*if
-
sech1
T(hash) = T(fill scratchpad) + T(JIT), right? If 256 (v1) and 288 (v2) have almost the same T(hash), and T(fill scratchpad) is the same because no changes there, then it means T(JIT) is also the same, so we can just do 288/256 and be done :D
-
sech1
288/256 = +12.5%, 312/280 = +11.4%
-
sech1
so we came to the same conclusion by two different methods
-
sech1
I just realized that keeping v2 hashrates the same (or a bit lower) is important to not increase dependency on fast tuned RAM. Higher hashrates require better RAM timings and better tuning.
-
tevador
While we're at it, we could also tweak the instruction frequencies...
-
sech1
I don't have any ideas on what to tweak there yet
-
sech1
Seems pretty balanced
-
tevador
It might be a good idea to have benchmarks of programs filled entirely with one type of instruction.
-
sech1
I would increase FMUL_R frequency a bit, because it's the most energy-burning instruction where ASICs would have minimal advantage
-
sech1
and reduce IXOR_R frequency proportionally
-
sech1
+4 for FMUL_R, -4 for IXOR_R
-
tevador
I'd have to check if we're not getting too many infinities. Btw that also applies to just increasing the isn count per iteration.
-
sech1
yes
-
sech1
+4 for FMUL_R, -4 for IXOR_R even increased hashrate a bit - from 1670 to 1676 h/s (v2 288)
-
tevador
There are probably tighter dependency chains between integer registers compared to floating point.
-
tevador
But on the other hand, one of the audits warned that we don't have enough dependencies in the code.
-
sech1
That's good, we don't want too many dependencies. Future CPUs will get wider and wider cores
-
tevador
With fewer dependency chains, we should probably increase CBRANCH a bit to kill VLIW designs.
-
sech1
Branch is every 10 instructions on average now. If we make it 9, it's still enough space for VLIW
-
sech1
to kill VLIW, we need branches every 3-4 instructions, and that's a bit extreme
-
sech1
it's better to increase energy burned from instruction execution (FMUL_R is the best), because you can't avoid executing it in any CPU design. VLIW only saves energy on instruction scheduling, not instruction execution
-
tevador
Btw, I managed to trim down the risc-v FDIV_M isn down to 56 bytes, which is quite impressive for scalar code. ARM needs 48 bytes with vector code. I'm estimating that risc-v vector equivalent will be 32 bytes, same as x86.