06:22:14 hyc you're right, this simple program works fine: https://paste.debian.net/hidden/fb9a8aea/ 06:22:39 and if you comment the second sys_icache_invalidate, it starts to output random numbers in the second printf 06:24:48 I suspect we have an undefined behavior somewhere in the code, and compiler takes "advantage" of it :D 06:29:17 well, ubsan did find something: https://paste.debian.net/hidden/70411197/ 06:31:23 fixing it didn't help 07:27:03 lol, it produces different hashes even if I just run JIT compiler once and then just call the generated program repeatedly :D 07:32:35 yep, the code is generated only once, write protection is on but hashes still change occasionally: https://paste.debian.net/hidden/cb3e2e49/ 07:32:37 magic 07:41:18 I think the hash gets wrong when the thread is rescheduled to another CPU core. If I increase RANDOMX_PROGRAM_ITERATIONS, I get more incorrect hashes. If I decrease it 4 times, all hashes are corrrect 11:08:07 I think it's an unaligned memory read or write somewhere in the JIT code. M1 doesn't segfault on this, it just silently reads garbage :D 13:01:29 hyc selsta "On AArch64 X18 is used as a platform register for some platforms, so to generate portable executables it should not be used by the compiler" 13:01:30 lol 13:01:42 RandomX aarch64 code uses it of course 13:01:57 I guess new macOS SDK uses it too and changes it 13:03:19 I'm not sure what's happening, but it looks like x18 can change right during the program execution. And it's not related to JIT at all. 13:04:04 This program produces different values: https://paste.debian.net/hidden/03d0943c/ (nop instructions are to get correct timings for the bug to happen) 13:04:20 Even without JIT, when I call it directly 13:05:00 If I change x18 to x10 in the program code, everything starts working magically 13:06:39 I was wrong, the previous paste is the working code 13:06:41 https://paste.debian.net/hidden/03d0943c/ works 13:06:48 https://paste.debian.net/hidden/51dbf4fa/ has bug 13:06:52 find 2 differences :D 13:07:29 Now I have to rewrite the whole aarch64 JIT to not use x18, that's unfortunate 13:19:05 seems Gimp had the same issue when porting to M1 Mac 13:20:05 also interesting that this didn't trigger on any other aarch64 system yet, seems modern Android also has x18 reserved 13:21:46 it was documented here, since we thought it's JIT related we kept looking at the wrong documentation https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms 13:22:57 yes, but now I need to rewrite the JIT 13:23:34 It still fails now, but at least it's consistent and in one place. So probably some other bug somewhere in the rewritten code. 13:23:44 *and fails in the same place 13:26:48 https://github.com/SChernykh/RandomX/tree/fix-x18 - the randomness went away, it always fails at "[85] Hash test 2b (compiler)" now 13:27:10 Probably some RandomX instruction is not fixed properly 13:27:24 But Hash test 2a (compiler) always passes, that's a good sign 13:29:55 I'll come back to it in the evening, exhausted now. Maybe hyc or tevador can spot the bug in the meantime. 13:39:17 also fails at 2b here at 300a0adb47603dedb42228ccb2b211104f4da45af709cd7547cd049e9489c969 14:17:39 selsta I fixed it: https://github.com/SChernykh/RandomX/blob/fix-x18/src/jit_compiler_a64.cpp#L486C34-L486C34 14:17:58 It used x18 and x19 before, then I changed it to x20, x21 but it should've been x19, x20 14:18:04 Now it passes all tests 14:18:10 8-D 14:18:33 You can make fresh git clone then git checkout fix-18 and try it 14:18:43 All tests PASSED 14:18:45 nice :D 14:20:01 hashes match in benchmark, yay https://paste.debian.net/hidden/ef3cf135/ 14:20:29 can you try to run the benchmark too? 14:20:41 ./randomx-benchmark --mine --init 8 --threads 8 --jit 14:22:23 Finally: https://github.com/tevador/RandomX/pull/281 14:22:54 https://paste.debian.net/hidden/33f756e9/ 14:24:08 nice 14:24:23 So it wasn't JIT after all :D 14:25:51 C/C++ CI / build-alpine (aarch64, latest-stable) fails, is this new? 14:28:58 Interesting 14:29:22 It crashed in dataset initialization 14:32:17 But it doesn't crash on M1, weird 14:35:01 wtf 14:35:12 It looks like I PR'd wrong code 14:36:19 no, all good on M1 14:40:41 well, now I have to debug it in qemu because M1 works 15:25:16 selsta I fixed it, all green now 15:27:20 once this is merged 4 issues can be closed 15:31:13 will retest once I'm home 16:02:30 nice work sech1 16:02:43 yeah 16:02:45 but it's weird 16:02:57 asm code saved and restored register x18 16:03:14 it's almost like some kernel interrupt uses it, so it can change in the middle of a function 16:03:42 whatever, if Apple say "no touch" that's what we do :D 16:03:51 ;) 16:05:07 my faith in JIT restored 16:05:13 :) 16:05:41 this bug resulted in so much weird behavior lol 16:06:19 even with the old SDK it would crash when started as a detached process 16:07:03 glad all of this is fixed now and I don't have to constantly change the SDK 16:07:39 Apple docs also mention register x29, but it's only used to restore nice callstacks, so we can ignore it 16:12:52 p2pool and xmrig also suffered from it (I found reported issues on Github) 16:16:30 ok, yeah randomx-tests passed here on my M1 16:20:44 randomx-benchmark is more relevant, it computes 1000 hashes 16:24:31 tevador https://github.com/tevador/RandomX/pull/281 16:24:46 I'm still on older SDK which never crashed before 16:25:06 1000 nonces ran fine here 16:25:48 I tested with newer SDK where it crashed or produced invalid hashes without the fix 16:27:02 but I guess now that we have this nailed I feel safe in upgrading my machine ;) 16:27:41 btw I think we have some unnecessary icache flushes in the code 16:27:46 it flushes cache twice 16:28:11 probably we added extra in prior attempts to kill the bug 16:28:15 The one in setPagesRX should be enough 16:29:02 I'll remove them later in a separate PR 16:29:46 Actually, __builtin___clear_cache in setPagesRX only works on macOS, so __builtin___clear_cache in JIT compiler is still needed :D 17:05:02 so it was a RandomX bug after all 17:06:07 yes 17:06:30 It's even mentioned here: https://en.wikipedia.org/wiki/Calling_convention#ARM_(A64) 17:06:46 But I didn't pay attention, I though that saving and restoring it before returning is enough 17:06:55 But it's actually a "volatile" register 17:08:04 I wouldn't call it a bug. The code correctly saved and restored x18, it was just not designed to be interrupted by kernel (or whatever else) and x18 being different after the interrupt 17:09:03 btw this explains why it works with SDK < 13: https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/arm64/machine_task.c#L279-L286 17:09:09 Even in this code: https://paste.debian.net/hidden/51dbf4fa/ where x18 is read right after it's written, it can still fail 17:09:11 The kernel has an explicit exception for it 17:09:49 gawd 17:11:05 hmm https://patchwork.kernel.org/project/linux-hardening/patch/20170712144424.19528-8-ard.biesheuvel⊙lo/ 17:11:09 so it's used in task switching 17:11:12 that explains it 17:11:33 then it's indeed a volatile register from used code's point of view 17:11:40 *user code 17:12:09 yes, so you'll get wrong hashes if there is a context switch in the middle of a program run 17:12:47 if you wrote that register you prob wouldn't survive a context switch 17:13:14 No, that would be a potential kernel exploit :D 17:13:28 Kernel initializes it before using it 17:13:35 it will probably just store some garbage in the register 17:13:50 if all it did was corrupt a hash calc, why did users get SEGVs then 17:14:15 because old RandomX code used x18 in dataset loads 17:14:21 if x18 was used to access memory in RandomX code, it could cause a segfault 17:14:38 x18 was used everywhere, it was a "temporary register" 17:14:51 ah. very temporary. 17:15:01 most places didn't access memory, so they would just cause an invalid hash 17:15:09 but a couple places used x18 for memory loads 17:15:16 you can just check the patch 17:15:16 so if there was a context switch between setting x18 and reading memory, boom 17:16:20 reminds me of my RISC-V trouble with the x4 register, but it's used for a different purpose there 17:16:31 yes, here: https://github.com/tevador/RandomX/pull/281/files#diff-7d6ee9862e6ac45d68d215a25c3fad24c846de3b0c30e923ff2819677e929cbaL325 17:16:32 boom 17:20:08 aha 17:35:41 next release we can add both risc-v support and fix macOS JIT 17:36:05 great 17:36:23 I didn't get risc-v gitian working yet but it's ready on RandomX side 20:03:32 since #281 is merged, shouldn't #262 be closed? it didn't auto-close because it wasn't mentioned in commit msg 20:04:18 yes 20:05:23 UB sanitizer also found a couple of unaligned memory reads in aarch64 JIT, but I'll fix it in a separate PR tomorrow 20:06:05 low priority because it's accessing 8 bytes which are only 4-byte aligned. But all current ARM CPUs don't mind it 20:08:42 cool