20:46:55 tevador: Your encouragement of a Crandall prime enables a batched point decompression due to a batch sqrt algorithm existing. 20:47:04 That's quite pleasant to discover :) 20:47:32 I haven't seen it prior discussed in literature, so I'm unsure if it's novel. 20:50:49 It's the first time I hear about batched sqrt. I only knew about batched inversion. 20:51:11 Is there a peper? 20:51:15 paper* 20:58:02 No, I just did it 20:58:15 Basically, instead of adding to the modulus, you subtract from the modulus. 20:58:32 256 squares + as many muls as bits set -> 256 squares + as many muls as bits not set 20:58:41 @kayabanerve:matrix.org: weird way to flex but ok 20:58:50 But with a Crandall prime, 192 bits are set, and 64 aren't, so the latter is much cheaper 20:59:10 @syntheticbird:monero.social: I was hoping tevador would have an idea of someone else who already did this 😅 21:00:04 You do need a multiplicative inverse to 'subtract' from the *exponent* (sorry, not "modulus", I meant to say instead of adding up to the exponent, you subtract down to it), yet inverses are famously batchable. 21:01:20 Helios and Selene of course have 128 bits set and 128 with 50% odds each, Ed25519 is almost entirely set making this ~50% faster to decompress Ed25519 points AFAICT. You basically have the squares, two extra squares, four multiplications, and the amortized batch inverse. 21:02:51 https://github.com/kayabaNerve/monero-oxide/blob/007eb4050ad5deae4085f21b6220e74f7c9ba4cf/monero-oxide%2Fed25519%2Fsrc%2Fhash_to_point%2Fmod.rs#L55-L98 for a very bad implementation. I'm trying to do something stupid and just producing a lot of legitimate research on the way as a side effect. 21:03:00 (so please forgive the code) 21:07:26 *my claim of four multiplications is due to the three for the batch inverse, and one to perform the 'subtraction'. One extra square is necessary to overshoot the exponent, allowing us to subtract from it. The second extra square is nonexistent, as while we do need a square for the unset bit and value to invert, it's folded into the existing squares and isn't additional. 21:08:26 That only gives you a square root candidate though, and you still need to check if its valid, and if not, move to the other for primes of the form 5 mod 8, as seen here with Ed25519 (costs not modelled in my above statement as the point is the improvement to the exponentiation step). 21:17:54 Tbf, it's really any prime with a significant skew in the distribution of how many bits are set which so benefit. 21:17:54 I just noted that while primes with few bits sets are faster inherently, now primes with most bits set are also so faster _in a batch setting_. 21:58:03 kayabanerve: It only saves about 5 multiplications. Still might be worth it, though. https://gist.github.com/tevador/e5491b90a47f316e75b6d9086ae57c6f 21:58:56 output from this tool: https://github.com/mmcloughlin/addchain 22:00:18 Ah. I wasn't comparing to an addition chain :/ Do you happen to know what chains for Helios/Selene would be an if those would still benefit from such a technique? 22:00:54 (Apologies for not just running the tool myself, I stepped away from my computer) 22:01:38 Tbf, the existing Helios and Selene code uses this premise for the top 128 bits, so I should've been aware. 22:02:19 And an addition chain is probably sufficient to capture the optimizable value in this structure... 22:05:18 in make_carrot_uncontextualized_shared_key_sender given that the following conditions apply, address_view_pubkey has been checked (is in prime order subgroup), 22:05:20 and enote_ephemeral_privkey is produced mod l (as make_carrot_enote_ephemeral_privkey does, // k_e = (H_64(anchor_norm, input_context, K^j_s, pid)) mod l) 22:05:28 The one distinction is the addition chain optimizes for repeated patterns of bits, while this optimizes for a skew in the distribution? So these are doing two separate things, even though there's a correlation between a skewed distribution and repeated patterns. 22:05:48 instead of doing s_sr = d_e ConvertPointE(K^j_v) (X25519 unclamped scalar mul after conversion) 22:05:54 would doing s_sr = ConvertPointE(d_e * K^j_v) be equivalent? 22:08:06 the latter is vastly faster on current implementations, given I use a vartime edwards25519 scalar mul there 22:10:09 This one is for Selene: https://gist.githubusercontent.com/tevador/e5491b90a47f316e75b6d9086ae57c6f/raw/8d070f679d658f57e672bff6608dc7443affbff8/addchain_selene.txt 22:13:42 DataHoarder: you shouldn't use vartime scalar multiplication with a secret exponent. 22:13:51 the exponent is public in this case 22:14:10 (I'm making specializations for P2Pool) 22:14:58 In that case, yes, those should give equivalent results. 22:15:05 Ouch. I think my poke at Selene was much worse than that chain. Such a shame :/ 22:15:26 alright, thanks. I have it tested but wanted some confirmation (will add notes under it) 22:16:44 Btw, for p2pool, is there a reason to use a scalar other than 1? 22:17:17 At least for Carrot, the context ensures the keys will always be unique. 22:17:31 this scalar is generated from make_carrot_enote_ephemeral_privkey which is a hash 22:17:55 which gets verified later on when scanning, afaik 22:18:50 yep, gets recreated under verify_carrot_normal_janus_protection 22:19:01 Janus protection should only be needed when mining to a subaddress, which is not supported? 22:20:05 it gets checked regardless 22:21:07 it'd be great if we could mine to subaddresses (given the additional pubkeys need to be included already) but AFAIK that could not get properly protected as amount was not encrypted 22:21:54 Btw, for Jamtis, P2Pool will probably need to fake the CSIDH key because it would be too slow to generate >100 shared secrets per block. 22:22:24 tbh this is not the expensive part for P2Pool, the get_output_proposal_parts is because it depends on amount which is random 22:22:31 Luckily, if you set the CSIDH pubkey to all zeroes, the shared secret is equal to the address key. 22:22:35 on p2pool we are able to calculate quite a bit of things ahead of time 22:22:58 well if you want jamtis feedback I can look at implementing whatever you have :D 22:23:39 With 700 miner outputs, just CSIDH would take ~20 seconds. 22:23:57 it's worse as well tevador, as whatever needs to be generated for one block needs to be recalculated every 5-10s in P2Pool 22:24:39 with FCMP++/carrot we can cache up to get_output_proposal_parts, which then cannot due to it depending on amount (even while it uses a fixed scalar 1 for blinding key) 22:24:45 Correction: CSIDH-1024 would take ~2 minutes. 22:25:29 maybe we could have CSIDH-pow :) 22:25:44 Is that not why we have threads? /s 22:26:04 DataHoarder: The proof of correctness would be nontrivial. 22:27:02 Actually, there is an isogeny based hash function. 22:28:10 ... but PoW in the form of submitting CSIDH ephemeral keys for real-world use? 22:28:36 (as to avoid shimming with zeroes) 22:29:50 at least for p2pool we can do the torsion checks plus everything else ahead of time, when we receive the address pubkeys in the sidechain 22:30:03 then it's only good data into the PPLNS window, so no more expensive checks need to be done 22:31:21 generating the outputs is however delaying share verification a lot before broadcast, so we might opt for delaying verification of that (and only verifying addresses/rewards), then broadcast, then verify (but we can't ban on this) 22:31:59 PoW is done previous to any of these steps - so it's not a free DoS even when they send bad outputs 22:35:17 The zero pubkey would only be allowed for coinbase transactions. We don't want lazy wallet developers to nullify post quantum security of the protocol. 22:35:52 scanning coinbase is already a special case, I guess 22:36:21 That'd screw with Serai, who may also poke at that if truly so computationally heavy. 22:36:44 ^ lazy developer spotted 22:36:50 for carrot is good enough, besides as mentioned, the amount :) 22:37:00 But my point was, defining PoW as performing ephemeral key generation would presumably require a proof the key was correctly generated, which is presumably non-trivial. 22:37:11 No, because I suggested using threading :p 22:37:45 But if we did have an enormous amount of TXs requested, I'd consider deferring to 0'd ephemeral keys if the backlog exceeds some time delay. 22:37:53 which changes the one time address calculation, it's three scalar mult + three point additions 22:38:14 per output, so 700* that on an RPi 5 ends up at around 14ms if I remember 22:38:54 The only reason for serai to not use zero is so such TXs don't stick out to those not already aware, which is a goal we have 22:39:32 The CGL hash simply hashes the data to generate the secret isogeny. So it is possible to verify it. 22:40:19 And the bits select the isogeny path, for example -1/+1 depending on the bit. 22:40:34 tbh it wouldn't be "that" bad if there was a faster way to verify the outputs were calculated properly, but afaik the best way is to ... recalculate them, lacking key information other than anchor randomness and receiver pubkeys 22:41:45 For p2pool, the only viable way is the zero key. Or having 64 CPU cores :) 22:42:01 yeah I was looking at numbers "hey that's not that bad" 22:42:10 then ofc I am running tests on 9900X3D 22:44:29 For #151, the remaining issue to discuss is if adding 13 ms to transactions verification time is OK. The rest seems to be clearly in favor of CSIDH-1024. 23:07:31 but if for coinbase 0'd ephemeral keys (or single shared one?) are acceptable, that'd more or less bring p2pool to exactly the point is at nowadays, except the cpu time and template switch delay is mostly unchanged (ephemeral keys are cacheable per block, one time address are not due to inclusion of amount)