-
sech1
hyc you're right, this simple program works fine:
paste.debian.net/hidden/fb9a8aea
-
sech1
and if you comment the second sys_icache_invalidate, it starts to output random numbers in the second printf
-
sech1
I suspect we have an undefined behavior somewhere in the code, and compiler takes "advantage" of it :D
-
sech1
well, ubsan did find something:
paste.debian.net/hidden/70411197
-
sech1
fixing it didn't help
-
sech1
lol, it produces different hashes even if I just run JIT compiler once and then just call the generated program repeatedly :D
-
sech1
yep, the code is generated only once, write protection is on but hashes still change occasionally:
paste.debian.net/hidden/cb3e2e49
-
sech1
magic
-
sech1
I think the hash gets wrong when the thread is rescheduled to another CPU core. If I increase RANDOMX_PROGRAM_ITERATIONS, I get more incorrect hashes. If I decrease it 4 times, all hashes are corrrect
-
sech1
I think it's an unaligned memory read or write somewhere in the JIT code. M1 doesn't segfault on this, it just silently reads garbage :D
-
sech1
hyc selsta "On AArch64 X18 is used as a platform register for some platforms, so to generate portable executables it should not be used by the compiler"
-
sech1
lol
-
sech1
RandomX aarch64 code uses it of course
-
sech1
I guess new macOS SDK uses it too and changes it
-
sech1
I'm not sure what's happening, but it looks like x18 can change right during the program execution. And it's not related to JIT at all.
-
sech1
This program produces different values:
paste.debian.net/hidden/03d0943c (nop instructions are to get correct timings for the bug to happen)
-
sech1
Even without JIT, when I call it directly
-
sech1
If I change x18 to x10 in the program code, everything starts working magically
-
sech1
I was wrong, the previous paste is the working code
-
sech1
-
sech1
-
sech1
find 2 differences :D
-
sech1
Now I have to rewrite the whole aarch64 JIT to not use x18, that's unfortunate
-
selsta
seems Gimp had the same issue when porting to M1 Mac
-
selsta
also interesting that this didn't trigger on any other aarch64 system yet, seems modern Android also has x18 reserved
-
selsta
it was documented here, since we thought it's JIT related we kept looking at the wrong documentation
developer.apple.com/documentation/x…ting-arm64-code-for-apple-platforms
-
sech1
yes, but now I need to rewrite the JIT
-
sech1
It still fails now, but at least it's consistent and in one place. So probably some other bug somewhere in the rewritten code.
-
sech1
*and fails in the same place
-
sech1
github.com/SChernykh/RandomX/tree/fix-x18 - the randomness went away, it always fails at "[85] Hash test 2b (compiler)" now
-
sech1
Probably some RandomX instruction is not fixed properly
-
sech1
But Hash test 2a (compiler) always passes, that's a good sign
-
sech1
I'll come back to it in the evening, exhausted now. Maybe hyc or tevador can spot the bug in the meantime.
-
selsta
also fails at 2b here at 300a0adb47603dedb42228ccb2b211104f4da45af709cd7547cd049e9489c969
-
sech1
-
sech1
It used x18 and x19 before, then I changed it to x20, x21 but it should've been x19, x20
-
sech1
Now it passes all tests
-
m-relay
<plowsof:matrix.org> 8-D
-
sech1
You can make fresh git clone then git checkout fix-18 and try it
-
selsta
All tests PASSED
-
selsta
nice :D
-
sech1
-
sech1
can you try to run the benchmark too?
-
sech1
./randomx-benchmark --mine --init 8 --threads 8 --jit
-
sech1
-
selsta
-
sech1
nice
-
sech1
So it wasn't JIT after all :D
-
selsta
C/C++ CI / build-alpine (aarch64, latest-stable) fails, is this new?
-
sech1
Interesting
-
sech1
It crashed in dataset initialization
-
sech1
But it doesn't crash on M1, weird
-
sech1
wtf
-
sech1
It looks like I PR'd wrong code
-
sech1
no, all good on M1
-
sech1
well, now I have to debug it in qemu because M1 works
-
sech1
selsta I fixed it, all green now
-
selsta
once this is merged 4 issues can be closed
-
selsta
will retest once I'm home
-
hyc
nice work sech1
-
sech1
yeah
-
sech1
but it's weird
-
sech1
asm code saved and restored register x18
-
sech1
it's almost like some kernel interrupt uses it, so it can change in the middle of a function
-
sech1
whatever, if Apple say "no touch" that's what we do :D
-
hyc
;)
-
sech1
my faith in JIT restored
-
hyc
:)
-
selsta
this bug resulted in so much weird behavior lol
-
selsta
even with the old SDK it would crash when started as a detached process
-
selsta
glad all of this is fixed now and I don't have to constantly change the SDK
-
sech1
Apple docs also mention register x29, but it's only used to restore nice callstacks, so we can ignore it
-
sech1
p2pool and xmrig also suffered from it (I found reported issues on Github)
-
hyc
ok, yeah randomx-tests passed here on my M1
-
sech1
randomx-benchmark is more relevant, it computes 1000 hashes
-
sech1
-
hyc
I'm still on older SDK which never crashed before
-
hyc
1000 nonces ran fine here
-
sech1
I tested with newer SDK where it crashed or produced invalid hashes without the fix
-
hyc
but I guess now that we have this nailed I feel safe in upgrading my machine ;)
-
sech1
btw I think we have some unnecessary icache flushes in the code
-
sech1
it flushes cache twice
-
hyc
probably we added extra in prior attempts to kill the bug
-
sech1
The one in setPagesRX should be enough
-
sech1
I'll remove them later in a separate PR
-
sech1
Actually, __builtin___clear_cache in setPagesRX only works on macOS, so __builtin___clear_cache in JIT compiler is still needed :D
-
tevador
so it was a RandomX bug after all
-
sech1
yes
-
sech1
-
sech1
But I didn't pay attention, I though that saving and restoring it before returning is enough
-
sech1
But it's actually a "volatile" register
-
sech1
I wouldn't call it a bug. The code correctly saved and restored x18, it was just not designed to be interrupted by kernel (or whatever else) and x18 being different after the interrupt
-
tevador
-
sech1
Even in this code:
paste.debian.net/hidden/51dbf4fa where x18 is read right after it's written, it can still fail
-
tevador
The kernel has an explicit exception for it
-
hyc
gawd
-
sech1
-
sech1
so it's used in task switching
-
sech1
that explains it
-
sech1
then it's indeed a volatile register from used code's point of view
-
sech1
*user code
-
tevador
yes, so you'll get wrong hashes if there is a context switch in the middle of a program run
-
hyc
if you wrote that register you prob wouldn't survive a context switch
-
sech1
No, that would be a potential kernel exploit :D
-
sech1
Kernel initializes it before using it
-
tevador
it will probably just store some garbage in the register
-
hyc
if all it did was corrupt a hash calc, why did users get SEGVs then
-
sech1
because old RandomX code used x18 in dataset loads
-
tevador
if x18 was used to access memory in RandomX code, it could cause a segfault
-
sech1
x18 was used everywhere, it was a "temporary register"
-
hyc
ah. very temporary.
-
sech1
most places didn't access memory, so they would just cause an invalid hash
-
sech1
but a couple places used x18 for memory loads
-
sech1
you can just check the patch
-
tevador
so if there was a context switch between setting x18 and reading memory, boom
-
tevador
reminds me of my RISC-V trouble with the x4 register, but it's used for a different purpose there
-
sech1
-
sech1
boom
-
hyc
aha
-
selsta
next release we can add both risc-v support and fix macOS JIT
-
hyc
great
-
selsta
I didn't get risc-v gitian working yet but it's ready on RandomX side
-
hyc
since #281 is merged, shouldn't #262 be closed? it didn't auto-close because it wasn't mentioned in commit msg
-
sech1
yes
-
sech1
UB sanitizer also found a couple of unaligned memory reads in aarch64 JIT, but I'll fix it in a separate PR tomorrow
-
sech1
low priority because it's accessing 8 bytes which are only 4-byte aligned. But all current ARM CPUs don't mind it
-
hyc
cool