-
br-m
<kayabanerve:matrix.org> tevador: Your encouragement of a Crandall prime enables a batched point decompression due to a batch sqrt algorithm existing.
-
br-m
<kayabanerve:matrix.org> That's quite pleasant to discover :)
-
br-m
<kayabanerve:matrix.org> I haven't seen it prior discussed in literature, so I'm unsure if it's novel.
-
tevador
It's the first time I hear about batched sqrt. I only knew about batched inversion.
-
tevador
Is there a peper?
-
tevador
paper*
-
br-m
<kayabanerve:matrix.org> No, I just did it
-
br-m
<kayabanerve:matrix.org> Basically, instead of adding to the modulus, you subtract from the modulus.
-
br-m
<kayabanerve:matrix.org> 256 squares + as many muls as bits set -> 256 squares + as many muls as bits not set
-
br-m
<syntheticbird> @kayabanerve:matrix.org: weird way to flex but ok
-
br-m
<kayabanerve:matrix.org> But with a Crandall prime, 192 bits are set, and 64 aren't, so the latter is much cheaper
-
br-m
<kayabanerve:matrix.org> @syntheticbird:monero.social: I was hoping tevador would have an idea of someone else who already did this 😅
-
br-m
<kayabanerve:matrix.org> You do need a multiplicative inverse to 'subtract' from the *exponent* (sorry, not "modulus", I meant to say instead of adding up to the exponent, you subtract down to it), yet inverses are famously batchable.
-
br-m
<kayabanerve:matrix.org> Helios and Selene of course have 128 bits set and 128 with 50% odds each, Ed25519 is almost entirely set making this ~50% faster to decompress Ed25519 points AFAICT. You basically have the squares, two extra squares, four multiplications, and the amortized batch inverse.
-
br-m
<kayabanerve:matrix.org>
github.com/kayabaNerve/monero-oxide…rc%2Fhash_to_point%2Fmod.rs#L55-L98 for a very bad implementation. I'm trying to do something stupid and just producing a lot of legitimate research on the way as a side effect.
-
br-m
<kayabanerve:matrix.org> (so please forgive the code)
-
br-m
<kayabanerve:matrix.org> *my claim of four multiplications is due to the three for the batch inverse, and one to perform the 'subtraction'. One extra square is necessary to overshoot the exponent, allowing us to subtract from it. The second extra square is nonexistent, as while we do need a square for the unset bit and value to invert, it's folded into the existing squares and isn't additional.
-
br-m
<kayabanerve:matrix.org> That only gives you a square root candidate though, and you still need to check if its valid, and if not, move to the other for primes of the form 5 mod 8, as seen here with Ed25519 (costs not modelled in my above statement as the point is the improvement to the exponentiation step).
-
br-m
<kayabanerve:matrix.org> Tbf, it's really any prime with a significant skew in the distribution of how many bits are set which so benefit.
-
br-m
<kayabanerve:matrix.org> I just noted that while primes with few bits sets are faster inherently, now primes with most bits set are also so faster _in a batch setting_.
-
tevador
kayabanerve: It only saves about 5 multiplications. Still might be worth it, though.
gist.github.com/tevador/e5491b90a47f316e75b6d9086ae57c6f
-
tevador
-
br-m
<kayabanerve:matrix.org> Ah. I wasn't comparing to an addition chain :/ Do you happen to know what chains for Helios/Selene would be an if those would still benefit from such a technique?
-
br-m
<kayabanerve:matrix.org> (Apologies for not just running the tool myself, I stepped away from my computer)
-
br-m
<kayabanerve:matrix.org> Tbf, the existing Helios and Selene code uses this premise for the top 128 bits, so I should've been aware.
-
br-m
<kayabanerve:matrix.org> And an addition chain is probably sufficient to capture the optimizable value in this structure...
-
DataHoarder
in make_carrot_uncontextualized_shared_key_sender given that the following conditions apply, address_view_pubkey has been checked (is in prime order subgroup),
-
DataHoarder
and enote_ephemeral_privkey is produced mod l (as make_carrot_enote_ephemeral_privkey does, // k_e = (H_64(anchor_norm, input_context, K^j_s, pid)) mod l)
-
br-m
<kayabanerve:matrix.org> The one distinction is the addition chain optimizes for repeated patterns of bits, while this optimizes for a skew in the distribution? So these are doing two separate things, even though there's a correlation between a skewed distribution and repeated patterns.
-
DataHoarder
instead of doing s_sr = d_e ConvertPointE(K^j_v) (X25519 unclamped scalar mul after conversion)
-
DataHoarder
would doing s_sr = ConvertPointE(d_e * K^j_v) be equivalent?
-
DataHoarder
the latter is vastly faster on current implementations, given I use a vartime edwards25519 scalar mul there
-
tevador
-
tevador
DataHoarder: you shouldn't use vartime scalar multiplication with a secret exponent.
-
DataHoarder
the exponent is public in this case
-
DataHoarder
(I'm making specializations for P2Pool)
-
tevador
In that case, yes, those should give equivalent results.
-
br-m
<kayabanerve:matrix.org> Ouch. I think my poke at Selene was much worse than that chain. Such a shame :/
-
DataHoarder
alright, thanks. I have it tested but wanted some confirmation (will add notes under it)
-
tevador
Btw, for p2pool, is there a reason to use a scalar other than 1?
-
tevador
At least for Carrot, the context ensures the keys will always be unique.
-
DataHoarder
this scalar is generated from make_carrot_enote_ephemeral_privkey which is a hash
-
DataHoarder
which gets verified later on when scanning, afaik
-
DataHoarder
yep, gets recreated under verify_carrot_normal_janus_protection
-
tevador
Janus protection should only be needed when mining to a subaddress, which is not supported?
-
DataHoarder
it gets checked regardless
-
DataHoarder
it'd be great if we could mine to subaddresses (given the additional pubkeys need to be included already) but AFAIK that could not get properly protected as amount was not encrypted
-
tevador
Btw, for Jamtis, P2Pool will probably need to fake the CSIDH key because it would be too slow to generate >100 shared secrets per block.
-
DataHoarder
tbh this is not the expensive part for P2Pool, the get_output_proposal_parts is because it depends on amount which is random
-
tevador
Luckily, if you set the CSIDH pubkey to all zeroes, the shared secret is equal to the address key.
-
DataHoarder
on p2pool we are able to calculate quite a bit of things ahead of time
-
DataHoarder
well if you want jamtis feedback I can look at implementing whatever you have :D
-
tevador
With 700 miner outputs, just CSIDH would take ~20 seconds.
-
DataHoarder
it's worse as well tevador, as whatever needs to be generated for one block needs to be recalculated every 5-10s in P2Pool
-
DataHoarder
with FCMP++/carrot we can cache up to get_output_proposal_parts, which then cannot due to it depending on amount (even while it uses a fixed scalar 1 for blinding key)
-
tevador
Correction: CSIDH-1024 would take ~2 minutes.
-
DataHoarder
maybe we could have CSIDH-pow :)
-
br-m
<kayabanerve:matrix.org> Is that not why we have threads? /s
-
br-m
<kayabanerve:matrix.org> DataHoarder: The proof of correctness would be nontrivial.
-
tevador
Actually, there is an isogeny based hash function.
-
br-m
<kayabanerve:matrix.org> ... but PoW in the form of submitting CSIDH ephemeral keys for real-world use?
-
br-m
<kayabanerve:matrix.org> (as to avoid shimming with zeroes)
-
DataHoarder
at least for p2pool we can do the torsion checks plus everything else ahead of time, when we receive the address pubkeys in the sidechain
-
DataHoarder
then it's only good data into the PPLNS window, so no more expensive checks need to be done
-
DataHoarder
generating the outputs is however delaying share verification a lot before broadcast, so we might opt for delaying verification of that (and only verifying addresses/rewards), then broadcast, then verify (but we can't ban on this)
-
DataHoarder
PoW is done previous to any of these steps - so it's not a free DoS even when they send bad outputs
-
tevador
The zero pubkey would only be allowed for coinbase transactions. We don't want lazy wallet developers to nullify post quantum security of the protocol.
-
DataHoarder
scanning coinbase is already a special case, I guess
-
br-m
<kayabanerve:matrix.org> That'd screw with Serai, who may also poke at that if truly so computationally heavy.
-
tevador
^ lazy developer spotted
-
DataHoarder
for carrot is good enough, besides as mentioned, the amount :)
-
br-m
<kayabanerve:matrix.org> But my point was, defining PoW as performing ephemeral key generation would presumably require a proof the key was correctly generated, which is presumably non-trivial.
-
br-m
<kayabanerve:matrix.org> No, because I suggested using threading :p
-
br-m
<kayabanerve:matrix.org> But if we did have an enormous amount of TXs requested, I'd consider deferring to 0'd ephemeral keys if the backlog exceeds some time delay.
-
DataHoarder
which changes the one time address calculation, it's three scalar mult + three point additions
-
DataHoarder
per output, so 700* that on an RPi 5 ends up at around 14ms if I remember
-
br-m
<kayabanerve:matrix.org> The only reason for serai to not use zero is so such TXs don't stick out to those not already aware, which is a goal we have
-
tevador
The CGL hash simply hashes the data to generate the secret isogeny. So it is possible to verify it.
-
tevador
And the bits select the isogeny path, for example -1/+1 depending on the bit.
-
DataHoarder
tbh it wouldn't be "that" bad if there was a faster way to verify the outputs were calculated properly, but afaik the best way is to ... recalculate them, lacking key information other than anchor randomness and receiver pubkeys
-
tevador
For p2pool, the only viable way is the zero key. Or having 64 CPU cores :)
-
DataHoarder
yeah I was looking at numbers "hey that's not that bad"
-
DataHoarder
then ofc I am running tests on 9900X3D
-
tevador
For #151, the remaining issue to discuss is if adding 13 ms to transactions verification time is OK. The rest seems to be clearly in favor of CSIDH-1024.
-
DataHoarder
but if for coinbase 0'd ephemeral keys (or single shared one?) are acceptable, that'd more or less bring p2pool to exactly the point is at nowadays, except the cpu time and template switch delay is mostly unchanged (ephemeral keys are cacheable per block, one time address are not due to inclusion of amount)