21:29:18 https://github.com/monero-project/research-lab/issues/98#issuecomment-1184910993 21:50:59 cache generation time on HDD makes it no go 21:54:35 it's too slow even on an SSD 21:54:39 I had another idea of pulling random data from blockchain (some random block number + corresponding tx_pruned from that block) at the beginning of each RandomX hash, and attaching this data to the final Blake hash. This way CPU and I/O can go in parallel, and HDD should be fast enough to be done in 10-15 ms 21:55:02 *random block number -> data from block and tx_pruned tables 21:56:03 I tried this, it's too slow. It's better to select a random block and a some random txs separately. 21:56:14 miners will have to run fast NVMe drivers or straight up loads of RAM 21:56:44 whatever, just select some random chunk of data from either table 21:56:59 so it could be done in one I/O operation 21:57:02 wouldn't that penalize miners on older hardware though, thus giving the upper hand to large farms/servers/high-end hardware 21:57:26 miners would have to run a RAM disk or some other db structure than LMDB 21:57:50 miners mine in fast mode, so they have 1-2ms to read the data on each thread 21:57:56 or LMDB with lots of RAM so that it never has to hit the disk 21:58:43 Actually, typical gaming PC is 10-20 kh/s, so any NVMe capable of 10-20k IOPS will work in theory 21:58:50 selecting 1 block blob means several random disk accesses with LMDB 21:58:52 with proper buffering of completed hashes 21:59:19 or rather queueing of completed hashes until their db data is available 22:00:11 Wait, what about pool miners? Or even people running multiple miners to a single node/p2pool? 22:00:37 they would need their own copies of the blockchain 22:00:37 Wouldn't the network traffic/latency kill the performance? 22:01:12 Right, but if you have to pull random data for each hash, then you'll have to wait for the node to respond 22:01:26 So you're adding internet latency in there, no? 22:01:35 even "slow" NVMe drives can do way above 100k IOPS in 100% read mode: https://www.techpowerup.com/review/kingston-nv1-1-tb/4.html 22:01:53 100% read mode is what will happen during mining, right? 22:02:21 there will be some writes as new blocks are added to the db 22:02:40 99.9% read then :D 22:02:52 probably 22:03:30 read in the beginning of each hash -> calculate hash normally -> queue the final Blake data until the db read is finished 22:03:56 if I/O is done asynchronously, it shouldn't even affect the hashrate that much 22:04:09 the problem is that miners could skip hashes that hit blockchain data they don't have 22:04:16 of course db in RAM will be much faster anyway 22:04:37 then we can do db read in the end 22:04:47 more queueing, still the same hashrate 22:04:49 you'd want to force at least 1 RandomX program before selecting the block, maybe more 22:06:11 what's the size of `blocks` and `tx_pruned` tables now? 22:06:28 around 10 GB together, give or take 22:06:44 ah, so should be easy to just load them to RAM in xmrig 22:06:56 no need for fancy I/O 22:07:45 LMDB basically does this, but it works even if you don't have enough RAM 22:08:05 but fragmentation 22:08:34 miners could easily have a dedicated db with just the two tables 22:09:03 probably not even a db, just two files growing 22:09:23 because new blocks usually are added to the end 22:09:52 you'd need to store at least the offsets where each blob starts, some primitive database 22:10:47 but then, pools would just need to send new pruned blobs to miners, less than 300 Kb per block, right? 22:10:51 plus the initial download 22:11:15 but it's still a lot of bandwidth for pools 22:11:53 you proposal required 256 Mb every 64 blocks, so 4 Mb per block which is more than with this approach 22:12:35 actually, even with the original proposal, pools could just send updates every 2 minutes (pruned blobs) 22:12:45 *every block 22:13:23 initial download can be done via torrent that updates every day or week 22:14:39 yes, but miners would need to keep the 10 GB database 22:15:02 that's a botnet killer 22:15:15 yeah, but the point was to cripple pools 22:16:26 the whole idea is that each miner must have blockchain data. It turns out that it's only 10 GB and pools can send updates every block which is not as much bandwidth as initially thought 22:16:27 I wonder who would provide the initial download to new miners 22:16:40 initial download via torrent 22:16:49 pool can just seed from their servers 22:17:04 with a pruned node, you can just run it and it will sync by itself 22:17:24 or pruned node, yes 22:17:29 so pool doesn't even need to provide it 22:17:53 if every miner ran a node, that would be a big win 22:22:41 "pool can just seed from their..." <- wouldn't that just open up the pool to an easy ddos? 22:22:48 miners will eventually end up with "leech" node that just gets data from real nodes and updates these two tables 22:22:58 less resources, less disc space used 22:23:50 And not just the pool, but also every miner who contributes would end up getting caught in it (assuming that their upload bandwidth is much smaller than a typical server) 22:23:59 so something like xmrig doing the hashing and "monero-db-miner-sync" updating the files in realtime 22:26:43 most likely setup in the end will be 1 real node per miner (they'll run own node for reliability) + all PCs in their network will run "leech" nodes to sync 22:27:56 so pools will still be possible, but miners will have to run nodes 22:28:19 your test was with a cold disk cache, so it's not a fair comparison to a RAMdisk setup 22:28:37 keep in mind that a RAMdisk setup requires application to manage shuttling cold/hot data in and out of RAMdisk 22:28:41 I'm not even sure it would kill botnets 22:28:42 who runs their node from a RAM disk? 22:28:54 whereas LMDB just needs to access blocks and let FS cache handle things 22:29:08 the amount of data being sent every 2 minutes is not that big, botnet can handle it with clever data distribution 22:29:18 my point is, LMDB will always be more efficient than any other solution 22:29:45 except when most of the database file is dead weight 22:29:54 on an active node you can assume most of the interior of the Btree is cached in RAM 22:30:08 therefore most data seeks will only incur 1 IOP 22:31:03 I measured an average "read amplification" factor of 8 with a standard pruned db file 22:31:22 that sounds like Linux default readahead 22:31:25 e.g. 2 GB actually read from the disk to read 256 MB of blob data 22:31:33 it always reads 64K for any 4K access 22:32:31 perhaps, but with a defragmented db, the amplification factor went down to ~2 22:33:20 so I'm assuming you read a page from the disk and most of the data in it is from other tables 22:33:41 except that we don't interleave data from separate tables onto a single page 22:34:47 logical page maybe, but the actual hardware page may be larger 22:34:49 and txn blobs will always be much smaller than 4K anyway 22:35:33 large-sector devices still only use 4KB per sector instead of 512B 22:35:57 SSDs nowadays use 8 or 16 KB pages 22:36:00 that's a pretty strange result 22:36:58 I don't think the OS cares, it will access VM-page sized blocks 22:37:47 SSDs cannot read a smaller unit than a page 22:38:06 and they typically cannot erase a smaller unit than 128 pages 22:38:11 that's not true. they can't *erase* a smaller unit than a page 22:38:38 they erase blocks, which are groups of pages 22:40:04 will have to hunt down a data sheet later 22:41:00 anyway, the read amplification you measured only makes sense if you used a cold cache, so internal btree pages needed to be read 22:41:49 the whole measurement is for about 500 000 reads (average blob size ~500 bytes) 22:41:59 and then used a warm cache for the "defragmented" DB 22:42:24 no, the defragmented db was also cold 22:42:53 if you know your SSD page size is 16KB you might try rerunning a test with LMDB pagesize set to 16K 22:43:39 I guess that would require rebuilding the db file 22:43:55 yes. mdb_dump / mdb_load will do it 22:45:18 of course, that will also defragment the data 22:47:20 it's true that the data is interleaved since it's written on the fly as blocks or txs arrive. but there is no interleaving within pages. so the fragmentation should only result in random seeks, nothing more 22:47:40 it shouldn't result in excess reads per request 23:06:32 So my OS page size is 4K, but the SSD page size is 16K. That explains the read amplification. 23:07:48 that's kinda bad. does x86 even support 16K VM pagesize? 23:09:57 HDDs will still be fine with 4K. I think even SMR drives are still 4K 23:10:41 what model SSD was that?