08:26:10 <l-m> hello all. i am the creator of RandomX.js (https://github.com/l1mey112/randomx.js) and i would like to ask for feedback
08:27:06 <l-m> i am currently in the process of speculating how i can make performance improvements, being that my library is only 5 times slower than the reference implementation. if i can make 10% improvements here and there, i would be happy to reach even 25-35 H/s at most
08:27:18 <l-m> i have an issue tracker here for performance tweaks: https://github.com/l1mey112/randomx.js/issues/1
08:28:36 <l-m> there has not been a webmining implementation of the monero POW in years, would anyone be able to gauge the performance of previous cryptonight POW webminers compared to the hashrate of mine?
09:03:23 <sech1> Cudos for effort, although the biggest performance hit is inability to use the 2GB full dataset mining mode.
09:04:12 <sech1> fused multiply-add instructions can't be used because they don't do rounding between MUL and ADD, so the end result will be different
09:25:45 <l-m> i use FMA instructions for efficient emulation of the differing counting modes. for example you can take a multiplication and subtract the result of that multiplication from itself "without an intermediate round", to get the error in that operation. you can take that error and adjust the final floating point number by branching on the sign of that
09:25:46 <l-m> error term. look up compensated summation/two-sum, two-product, error free transforms. https://indico.cern.ch/event/313684/contributions/1687773/attachments/600513/826490/FPArith-Part2.pdf for an introduction
09:26:58 <l-m> implementations of different rounding modes with FMA are just a couple cycles slower, but without FMA you can still emulate it effectively with a ~10 FP operation overhead, even less on superscalar + out of order machines.
09:27:42 <l-m> directed rounding isn't my problem really, its just the AES. though there is no way to be sure without instrumentation/performance counters which i am working on currently
10:07:49 <sech1> Ah, FMA is for this purpose. Then it should work. I used it for the same purpose in my RandomX OpenCL code, it works and it's deterministic.
10:10:25 <sech1> Software AES gives ~30% slowdown in the native code, it should be probably the same in WebAssembly
10:35:19 <l-m> given this then, im guessing all of the overhead lies in the fact that repeat instantiations of the JIT WASM code 8 times per chained VM execution is the issue.
10:35:35 <l-m> there is just too much overhead performing `new WebAssembly.Module()`
10:36:57 <l-m> there is work done for WebAssembly baseline JITs to make this faster, but in reality the library incurs cost generating the WASM, then allowing the host to generate the native code
10:40:37 <sech1> That 1 h/s per thread estimate was made way back in 2019, WASM was probably in much worse shape back then
10:41:07 <sech1> Or I think the estimate was made for a pure JS interpreter
10:41:21 <l-m> pure JS interpreter most likely
10:42:13 <sech1> There was even an implementation made in 2019 by someone, I remember checking that website and it indeed was less than 1 h/s
10:42:24 <sech1> But they didn't even verify correctness of the hashes
11:02:30 <l-m> i could have seen RandomWOW being much faster than RandomX due to its "light" implementation, but the fact that it uses 16 chained executions means i loose out on all gains
11:05:28 <l-m> i just checked, and on the randomwow branch after adjusting the amount of program iterations from 16 to 8 i achieve 48 H/s
11:05:47 <l-m> reaching true a "light" randomx
12:31:14 <sech1> "there is just too much overhead performing `new WebAssembly.Module()`" can't you keep the instance between RandomX VM executions? I'm not very familiar with WASM
13:04:44 <l-m> sech1 the JIT generates a randomx program in WASM, which needs to be instanciated to be converted to native code, 8 times per chained execution. JIT -> WASM -> new WebAssembly.Module() -> native code. the JS runtime/host needs to perform another step converting the WASM to native code, in a native environment, you can just JIT the native code.
13:05:37 <l-m> to run WASM code, the host needs to compile it some way into native code. there is quite an overhead doing so, you don't get access to native code immediately
13:05:46 <sech1> So it's kind of running a compiler to get native code, every time. Yes, it will be slow.
13:06:00 <l-m> especially if you're only using it once and throwing it away, there is no chance to optimise at all and you'll be left with the subpar baseline JIT
13:06:30 <l-m> https://v8.dev/blog/liftoff - v8 for example would generate subpar code for the randomx VM
16:21:12 <hyc> well yes, that's all to be expected. the point was making each program run only once before chaining to the next one, so none of the first is reusable
18:38:59 <m-relay> <i​udfasjdjf:luc.cat> black lives dont matter nigger!
23:24:26 <l-m> hello again. can someone provide me resources on how to better understand how xmrig works and the stratum protocol? i am looking to reimplement the randomx part of xmrig to run in the browser, and release it as open source software (with a small dev fee, like xmrig)
23:25:15 <l-m> i am quite new to the implementations of cryptocurrency mining, it would be good to understand how block templates work, nonces when mining, im all very new
23:26:46 <l-m> i saw that sech1 is a developer on xmrig, maybe you could give me a small rundown?