01:12:55 i've always thought one could use the bitcoin blockchain as a base truth, and then create a synthetic ring signatures to create a synthetic monero blockchain with a known ground truth, and then apply whatever whatsits and hoozbangs to try and go from synthetic -> bitcoin 12:56:41 Meeting today at 18:00 UTC, No Wallet Left Behind in Matrix room. 14:35:24 tonz0fphun: "whereas the Authors provide some interesting arguments, those arguments are Hypotheses. There exists no evidence that the RingCT obfuscation mechanism is easy or possible to compromise." I don't agree with this in general. If the decoy selection algorithm is very different from user behavior, then guessing the real spend becomes likely. 14:39:45 Moser et al. (2018) showed that: https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=15 14:41:46 Before 2018, Monero had many privacy flaws. The decoy selection algorithm was very different from the real spend age distribution. Moser et al. (2018) just guessed that the most recent output in the ring was the real spend and they achieved 80% correct guesses. 14:43:31 How did they know the guesses were correct? They used another technique, sometimes called "chain reaction", that exploited the optional ring size (and possible 1-ring size) transactions to eliminate possible outputs using graph analysis. The technique is deterministic: With chain reaction you can say with certainty a certain ring member was the real spend. 14:44:40 Chain reaction was mostly fixed by enforcing a higher ring size and using RingCT to hide amounts. The "Guess Newest" heuristic is not very effective anymore since the decoy selection algorithm has changed. 14:48:09 tonz0fphun: The problem with using ACK-J's dataset is that it's not real users. A computer controlled the spending habits. Since a statistical or machine learning attack would usually exploit the difference between the decoy selection algorithm and real user behavior, the value of that dataset is limited. Still useful for certain purposes, but limited. 14:49:35 IMHO, it would be better to use the real Monero mainnet data for a clustering analysis. neptune has some software to extract Monero transaction data from the blockchain to put into a usable format. It requires some setup: https://github.com/neptuneresearch/ring-membership-sql 14:50:14 If you want to use mainnet data, we could figure out a way to get the data to you in a usable format. 14:50:28 Thanks for looking into the issue! 14:52:46 There are many other papers about this. I just don't want to flood you with them without your permission :) 15:00:45 @Rucknium feel free to send them to me in a PM. So basically you're saying that a simulation would not provide a synthetic dataset which is realistic enough? Does Neptune provide annotations? I'll have a look later tonight and see what I can get out of it. 15:01:52 What do you mean by annotations? 15:02:16 I will just post the papers here with commentary so others can see. 15:03:51 I am doing my own research on this issue, but I'm using traditional frequentist statistics. A machine learning approach is complementary and needed. 15:04:08 Is the mainnet data making associations between an obfuscated output and the real transaction? 15:04:10 That's what I mean by "annotation" 15:04:16 or "label" 15:04:48 No. 15:04:53 if it's not then I can't use it with supervised approaches, only unsupervised, hence why I would consider creating a synthetic dataset using a simulation 15:05:05 Exactly 15:05:06 Ok, fair enough 15:05:21 You would have to have an unsupervised technique 15:26:09 yeah, I would. I can try that, but that implies there's a relationship which is discoverable from some form of featurising, clustering, manifolding, etc. A Labelled approach would always be preferred in this scenario. 15:29:04 Monero's protocol guarantees that a true labeled approach is impossible. That's a good thing for user privacy :) 15:30:14 So there's no way for me to build a synthetic dataset by extracting meta-data, such as a group of transatictions, the majority of which are fake, and the actual one that is real (and all related meta)? 15:30:28 The minor except to that would be service providers that have access to some user traffic. Centralized exchanges would know which of the withdrawals are real spends. Is that useful? It would be a biased sample definitely. 15:30:53 tonz0fphun: I think rucknium is trying to say that we’ve gotten as close as we can testing supervised learning with synthetic datasets and we largely found no evidence of significant metadata leakage. An unsupervised approach seems more interesting to the research community at the moment and could be able to cluster user wallets, spending patterns, etc.. 15:31:41 MyMonero has some info about user behavior too. That's a biased sample. They would not give it up even if it is for research to improve the protocol, I'm pretty sure. 15:32:02 xmrack: Yes 15:35:39 Ok, both fair points. IMHO a supervised approach will yield significantly better results. The caveat is that generating such a dataset is time-consuming, complex and possibly a large undertaking. For an unsupervised approach, I'll look into Neptune and whatever other datasets there are, and see how I can featurise them. However, bear in mind that such an approach will not yield the same insights, as it will be lacking 15:35:39 information that a supervised approach already contains. If that's something the community is ok with, then the real task is more of an "observationall" approach as to what can be inferred passively if that makes sense. 15:41:08 tonz0fphun: I think that would be wonderful 15:42:59 We have meetings every Wednesday at 17:00 UTC here in this room. It's text chat only. You could maybe get more feedback there. The currently-active MRL researchers working on statistical and machine learning attacks on Monero privacy are mostly me, xmrack , and isthmus . 15:45:39 isthmus: has found some cool things in the past using variations of DBSCAN if I remember correctly 15:47:16 tonz0fphun: you are 100% right with the time consuming part. I spent hundreds if not thousands of hours trying to find the best way to collect the datasets and featurize the results. After all that it was still no where close to perfect 15:48:25 I do like gingeropolous idea to add ring signatures to bitcoin transactions though 15:49:50 What's the advantage of doing that? I'm not criticising, my understanding of blockchain is somewhat limited. Isn't that a synthetic dataset manipulation? 15:53:23 It's synthetic but at least based on real user data on another real blockchain. xmrack 's project generated transactions based on some simple rules. 15:54:52 The bitcoin idea is "ok", but I'm not sure it's worth the effort at this point. If we had more labor resources, then yes we could allocate some to that idea. 15:59:44 "I am doing my own research on..." <- I did some ml stuff. Not an expert by any means (just some experience in chess programming), but I have some compute as well (not much, but enough) 16:00:43 If someone wants to transfer me some data, maybe I can check some stuff 16:02:54 (in chess now there's a problem of how to feature-ize data as well. Networks have to be very small for ab engines) 16:04:11 Ok, I'm happy to proceed with unsupervised, but I do like the idea of synthesizing a labelled dataset, because from there I can reverse the features (remove them one at a time) and infer which might be the one (or combination) that leaks information. But I do agree this is a considerably larger undertaking. If the Bitcoin transactions can be wrapped with ring signatures and then used to create such a dataset, then that would 16:04:12 be vastly simpler than setting up an experimental framework 16:05:47 Would be quite interesting to use Bitcoin data 16:07:21 Rucknium: you have studied the spending habbits of btc, ltc, and doge recently. Which do you think is closest to the spending habits of monero using the Moser de anonymized set as a baseline 16:07:24 Personally, I wouldn't use BTC but something with more Monero-like characteristics like LTC, BCH, or DOGE. Low fees. Lower tx volume than BTC. 16:07:27 Habits* 16:07:42 Lol great minds think alike 16:08:17 You would have to implement the decoy selection algorithm on those coins. I am almost finished with a math formula of the decoy selection algorithm. Put it on hold for a while. 16:10:08 xmrack: I don't know. Maybe I should try to answer that question. The Moser et al. (2018) data is old. You would want to compare the same time period. BCH didn't exist as a separate chain at that point by the way. 16:14:31 tonz0fphun: Here is that analysis, by the way: https://rucknium.me/html/spent-output-age-btc-bch-ltc-doge.html 16:14:44 Source code: https://github.com/Rucknium/OSPEAD/tree/main/General-Blockchain-Age-of-Spent-Outputs 16:18:46 Rucknium[m]: 👍️ I'll go through it too. 16:20:37 Here are the papers. Don't say I didn't warn you. 16:20:54 Mackenzie, A., Noether, S., & Monero Core Team. (2015). Improving obfuscation in the cryptonote protocol. https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=7 16:22:27 This paper is the first, to my knowledge, to write about the timing issue with decoy selection. It was released less than a year after the Monero blockchain began. In Section 3.1 Temporal Associations 16:23:29 Kumar, C., Tople, S., & Saxena, P. (2017), "A traceability analysis of monero’s blockchain." https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=21 16:23:58 This paper is pretty similar to the Moser et al. (2018) paper. Uses similar techniques. 16:25:16 Ye, C., Ojukwu, C., Hsu, A., & Hu, R. (2020). "Alt-coin traceability." https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=18 16:26:14 This paper analyzes Monero and Zcash. It re-runs the Moser analysis on newer data and finds that the Moser techniques are mostly ineffective against the improved Monero ring signature model with RingCT and a different decoy selection algorithm. 16:28:46 There are several papers that concentrate on chain reaction attacks. I read a few and shared my thoughts here: https://libera.monerologs.net/monero-research-lab/20220706#c117336 16:29:18 Chain reaction attacks probably aren't something that machine learning would be good at. 16:29:40 Anyway, they are probably ineffective at Monero's current ring size. 16:31:04 Here's another chain reaction paper that I didn't discuss in the link above: Vijayakumaran, S. (2021). "Analysis of cryptonote transaction graphs using the Dulmage-Mendelsohn Decomposition." https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=39 16:32:13 Ronge, V., Egger, C., Lai, R. W. F., Schröder, D., & Yin, H. H. F. (2021). "Foundations of ring sampling." https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=19 16:33:26 This paper gives a good formalization of the decoy selection problem. Basically, there are two options. There is "mimicking", which is what Monero tries to do currently. Match the real user behavior as much as possible. 16:34:28 Then there is "partitioning" or using a single "bin". Basically, "eliminate" the timing problem by always selecting ring members from a specific contiguous group of transaction outputs. 16:35:36 The main problem with partitioning is that the approximate time that you made your previous transaction would always be linked, even if the observer didn't know exactly which output was being spent. It would also be subject to targeted flooding or "black marble" attacks. 16:37:23 There is a conversation here about partitioning: https://libera.monerologs.net/monero-research-lab/20220829#c142533 16:38:37 Right now Monero uses no binning at all since we think ring size is too small for it to be done without tradeoffs being too steep. 16:40:18 In the draft of the Seraphis code binning is implemented. Basically a hybrid of mimicking and a strict partition. e.g. choose 16 bins of 8 ring members each for a total ring size of 128 16:40:29 Binning, as a hybrid strategy, has not been rigorously analyzed. 16:41:26 IMHO, many papers have been enthusiastic about partitioning because the authors are not statisticians and cannot figure out how to get a good mimicking decoy selection algorithm. 16:43:14 Deuber, D., Ronge, V., & Rueckert, C. (2022). "SoK: Assumptions underlying cryptocurrency deanonymizations". https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=97 16:44:17 Describes some techniques against bitcoin and Monero. 16:45:35 Otávio Chervinski, J., Kreutz, D., & Yu, J. (2021), "Analysis of transaction flooding attacks against Monero." https://moneroresearch.info/index.php?action=resource_RESOURCEVIEW_CORE&id=43 16:46:24 Flooding is a type of active attack against Monero privacy. Probably not too relevant to machine learning since ML is "passive" observation. 16:47:26 Some of us wrote a piece on the first documented flooding incident: https://mitchellpkt.medium.com/fingerprinting-a-flood-forensic-statistical-analysis-of-the-mid-2021-monero-transaction-volume-a19cbf41ce60 16:48:32 Rucknium[m]: It could be active though, really you could use it for anythibg 16:48:47 Like iterative search 16:49:10 I think I linked you this before: My draft "paper" on how to estimate the real spend age distribution for creating a mimicking decoy selection algorithm: https://github.com/monero-project/research-lab/issues/93 16:49:54 ghostway: Could you explain? 16:49:54 You did, yea 16:51:01 I am done with spamming papers :) 16:55:39 In passive observation, I guess you mean "hey this is some transactions, give me guesses". But this really isn't what you're trying to do. Just a little example of "active" predictions, is to change states. Like puct search (this is with perfect information games, but can be adapted after some thought on the specifics). You have a root node, according to some policy you sample the childrens, and then iteratively expand their 16:55:40 children (with a "value" prediction being a correction to that policy) all by trying to compromise between exploitation (you know this edge is probably better to do, so just visit it more) and exploration (maybe the other moves are ok as well, maybe they have a refutation). This is a very broad description and lacking many details. But gives the idea of what I'm trying to say 16:55:40 You change your own state, while gathering information and exploring deeper, trying to be efficient in sampling 16:56:36 Well, the Monero blockchain isn't a game in the game theoretic sense. Maybe something could be learned. 16:56:47 If you've heard of deepmind's alpha zero... It's that.. 16:57:48 Is it useful form something that isn't a game? 16:58:34 Monero's blockchain isn't a game. There are players, but no strategies and no payoffs really. Maybe you could convince me that it can be formed as a game 16:59:24 Of course it's not a game easily. But intuitively (without much thought) it can be made out of this 17:00:36 Also, I was just making an example of puct. Another algorithm is probably needed, but using some graphs is the answer many times 17:01:46 That example being a way of using ml to fill unknowns iteratively whenever we get new data 17:04:41 Maybe isthmus has some thoughts. He has a chess repo on his GitHub: https://github.com/Mitchellpkt/Chess-Tactical-Vision 17:07:25 Lol, sure. I think you're a little too fixated on the chess thing, or you just don't agree with what I said about iteratively gathering clues 17:10:12 Btw, his issue on GitHub is quite easy to fix... Instead of & with the other's pieces when calculating attacks, just don't... 17:12:57 I don't mean to discourage. I'm trying to understand. I focus on traditional statistics instead of machine learning so I'm probably missing something. 17:57:51 time 17:58:49 yeah, using litecoin as a base truth would make sense. with bitcoin, you'd also have to "clean the data" with regards to coinjoins and other obfuscation techniques 18:00:05 i don't know how far down the graph analysis rabbit hole this effort would have to go. because on one extreme, you would need to trace all the activity on the chain to know the ground truth 18:00:19 though maybe just per transaction, knowing whether you can identify the true spend is enough 18:04:14 but at the end of the day, an open-source, free-for-anyone-to-use blockchain tracing tool would be beneficial 18:05:25 I don't know the techniques for that, but if you want, I'd be inclined to learn and write it (when I understand it...). Any resources? 18:13:18 Resources for open source blockchain tracing? 18:13:35 Blockchain tracing, heh 18:20:34 This is the best I'm aware of, but it's not maintained and I think it doesn't build anymore: https://github.com/citp/BlockSci 18:22:46 Oh, I misread the message 18:23:01 I meant resources, for techniques / how to develop those one 18:23:12 (how they developed them) 18:25:13 BlockSci was developed as a proof of concept for a paper. The link is in the GitHub repo 18:26:00 There are other papers, but....do you want me to post more papers? 18:28:55 Lol, do you want to open a: "rucknium paper tantrum" room? 18:46:04 Here's one paper. The nice thing about research papers is that they usually cite most of the relevant previous papers: https://arxiv.org/abs/2107.05749 18:48:53 Thanks! 18:49:23 I hope I'll be helpful in that space when I have a bit more timr 18:53:43 MOAR PAPERS 19:14:43 This is another good recent bitcoin tracing paper: https://arxiv.org/abs/2205.13882