-
m-relay<kayabanerve:matrix.org> Tritonn's Helioselene contest submission included a binary GCD implementation (github.com/Tritonn204/fcmp-plus-plu…elene-contest-src/src/field.rs#L640) for its inversion function, which scored ~20% faster than the reference implementation's inversion function. I attempted to a<clipped messa
-
m-relay<kayabanerve:matrix.org> pply it to lederstrumpf's submission and observed *an 18% drop in performance*. While I've since spent several hours providing an optimized implementation of the binary GCD on top of lederstrumpf's submission (github.com/kayabaNerve/fcmp-plus-pl…rypto/helioselene/src/field.rs#L448), achieving more than a 20% improvement in performance, there's the question <clipped messa
-
m-relay<kayabanerve:matrix.org> of why the out-of-the-box implementation had such a degradation. Ideally, we would have captured both the 20% benefit observed with Tritonn's implementation _and_ all the further improvements I seeked.
-
m-relay<kayabanerve:matrix.org> With how optimized my code is, I don't believe there's too much room on the table, but I'd like to put forward a call to the developer community to poke and see if anything sticks out as further optimizable, or underlying the distinction in performance between Tritonn's code and my own (which only had a comparable performance benefit _after_ I spent several hours of optimizing the<clipped messa
-
m-relay<kayabanerve:matrix.org> implementation itself).
-
m-relay<kayabanerve:matrix.org> It isn't too notable, but is bothering me and is one of the most notable remaining performance gains still on the table, before re-architecting with assembly and so on (if it is actually on the table). Feel free to privately reach out if preferred.
-
m-relay<kayabanerve:matrix.org> (It was to post here or in monero-dev, this is the next-generation Monero room, and some people minded when there was too much contest discussion in monero-dev, hence me here. Apologize if this isn't the best room)
-
moneromoooIf the inv code seems it should be faster but isn't by a fair amount, this might point to some kind of cache effect rather than algorithmic problem.
-
m-relay<kayabanerve:matrix.org> With the direct port, the only change should've been the U256 type present. One is from a library, while tritonn forked and vendored a minimal derivative. They should've had the same size/layout as they were both thin wrappers around uint64_t[4]. My guess is either the black hole of compiler optimization pipelines just loved one and not the other for no legitimate reason, OR there<clipped messa
-
m-relay<kayabanerve:matrix.org> 's some distinction in the lower-level addition functions. There actually was exactly such a distinction, yet applying that patch 'only' caused an 8% improvement, not explaining the 40% difference in entirety.
3 hours ago