21:29:18 <tevador> https://github.com/monero-project/research-lab/issues/98#issuecomment-1184910993
21:50:59 <sech1> cache generation time on HDD makes it no go
21:54:35 <tevador> it's too slow even on an SSD
21:54:39 <sech1> I had another idea of pulling random data from blockchain (some random block number + corresponding tx_pruned from that block) at the beginning of each RandomX hash, and attaching this data to the final Blake hash. This way CPU and I/O can go in parallel, and HDD should be fast enough to be done in 10-15 ms
21:55:02 <sech1> *random block number -> data from block and tx_pruned tables
21:56:03 <tevador> I tried this, it's too slow. It's better to select a random block and a some random txs separately.
21:56:14 <sech1> miners will have to run fast NVMe drivers or straight up loads of RAM
21:56:44 <sech1> whatever, just select some random chunk of data from either table
21:56:59 <sech1> so it could be done in one I/O operation
21:57:02 <merope> wouldn't that penalize miners on older hardware though, thus giving the upper hand to large farms/servers/high-end hardware
21:57:26 <tevador> miners would have to run a RAM disk or some other db structure than LMDB
21:57:50 <sech1> miners mine in fast mode, so they have 1-2ms to read the data on each thread
21:57:56 <tevador> or LMDB with lots of RAM so that it never has to hit the disk
21:58:43 <sech1> Actually, typical gaming PC is 10-20 kh/s, so any NVMe capable of 10-20k IOPS will work in theory
21:58:50 <tevador> selecting 1 block blob means several random disk accesses with LMDB
21:58:52 <sech1> with proper buffering of completed hashes
21:59:19 <sech1> or rather queueing of completed hashes until their db data is available
22:00:11 <merope> Wait, what about pool miners? Or even people running multiple miners to a single node/p2pool?
22:00:37 <tevador> they would need their own copies of the blockchain
22:00:37 <merope> Wouldn't the network traffic/latency kill the performance?
22:01:12 <merope> Right, but if you have to pull random data for each hash, then you'll have to wait for the node to respond
22:01:26 <merope> So you're adding internet latency in there, no?
22:01:35 <sech1> even "slow" NVMe drives can do way above 100k IOPS in 100% read mode: https://www.techpowerup.com/review/kingston-nv1-1-tb/4.html
22:01:53 <sech1> 100% read mode is what will happen during mining, right?
22:02:21 <tevador> there will be some writes as new blocks are added to the db
22:02:40 <sech1> 99.9% read then :D
22:02:52 <tevador> probably
22:03:30 <sech1> read in the beginning of each hash -> calculate hash normally -> queue the final Blake data until the db read is finished
22:03:56 <sech1> if I/O is done asynchronously, it shouldn't even affect the hashrate that much
22:04:09 <tevador> the problem is that miners could skip hashes that hit blockchain data they don't have
22:04:16 <sech1> of course db in RAM will be much faster anyway
22:04:37 <sech1> then we can do db read in the end
22:04:47 <sech1> more queueing, still the same hashrate
22:04:49 <tevador> you'd want to force at least 1 RandomX program before selecting the block, maybe more
22:06:11 <sech1> what's the size of `blocks` and `tx_pruned` tables now?
22:06:28 <tevador> around 10 GB together, give or take
22:06:44 <sech1> ah, so should be easy to just load them to RAM in xmrig
22:06:56 <sech1> no need for fancy I/O
22:07:45 <tevador> LMDB basically does this, but it works even if you don't have enough RAM
22:08:05 <sech1> but fragmentation
22:08:34 <tevador> miners could easily have a dedicated db with just the two tables
22:09:03 <sech1> probably not even a db, just two files growing
22:09:23 <sech1> because new blocks usually are added to the end
22:09:52 <tevador> you'd need to store at least the offsets where each blob starts, some primitive database
22:10:47 <sech1> but then, pools would just need to send new pruned blobs to miners, less than 300 Kb per block, right?
22:10:51 <sech1> plus the initial download
22:11:15 <sech1> but it's still a lot of bandwidth for pools
22:11:53 <sech1> you proposal required 256 Mb every 64 blocks, so 4 Mb per block which is more than with this approach
22:12:35 <sech1> actually, even with the original proposal, pools could just send updates every 2 minutes (pruned blobs)
22:12:45 <sech1> *every block
22:13:23 <sech1> initial download can be done via torrent that updates every day or week
22:14:39 <tevador> yes, but miners would need to keep the 10 GB database
22:15:02 <tevador> that's a botnet killer
22:15:15 <sech1> yeah, but the point was to cripple pools
22:16:26 <sech1> the whole idea is that each miner must have blockchain data. It turns out that it's only 10 GB and pools can send updates every block which is not as much bandwidth as initially thought
22:16:27 <tevador> I wonder who would provide the initial download to new miners
22:16:40 <sech1> initial download via torrent
22:16:49 <sech1> pool can just seed from their servers
22:17:04 <tevador> with a pruned node, you can just run it and it will sync by itself
22:17:24 <sech1> or pruned node, yes
22:17:29 <sech1> so pool doesn't even need to provide it
22:17:53 <tevador> if every miner ran a node, that would be a big win
22:22:41 <merope> <sech1> "pool can just seed from their..." <- wouldn't that just open up the pool to an easy ddos?
22:22:48 <sech1> miners will eventually end up with "leech" node that just gets data from real nodes and updates these two tables
22:22:58 <sech1> less resources, less disc space used
22:23:50 <merope> And not just the pool, but also every miner who contributes would end up getting caught in it (assuming that their upload bandwidth is much smaller than a typical server)
22:23:59 <sech1> so something like xmrig doing the hashing and "monero-db-miner-sync" updating the files in realtime
22:26:43 <sech1> most likely setup in the end will be 1 real node per miner (they'll run own node for reliability) + all PCs in their network will run "leech" nodes to sync
22:27:56 <sech1> so pools will still be possible, but miners will have to run nodes
22:28:19 <hyc> your test was with a cold disk cache, so it's not a fair comparison to a RAMdisk setup
22:28:37 <hyc> keep in mind that a RAMdisk setup requires application to manage shuttling cold/hot data in and out of RAMdisk
22:28:41 <sech1> I'm not even sure it would kill botnets
22:28:42 <tevador> who runs their node from a RAM disk?
22:28:54 <hyc> whereas LMDB just needs to access blocks and let FS cache handle things
22:29:08 <sech1> the amount of data being sent every 2 minutes is not that big, botnet can handle it with clever data distribution
22:29:18 <hyc> my point is, LMDB will always be more efficient than any other solution
22:29:45 <tevador> except when most of the database file is dead weight
22:29:54 <hyc> on an active node you can assume most of the interior of the Btree is cached in RAM
22:30:08 <hyc> therefore most data seeks will only incur 1 IOP
22:31:03 <tevador> I measured an average "read amplification" factor of 8 with a standard pruned db file
22:31:22 <hyc> that sounds like Linux default readahead
22:31:25 <tevador> e.g. 2 GB actually read from the disk to read 256 MB of blob data
22:31:33 <hyc> it always reads 64K for any 4K access
22:32:31 <tevador> perhaps, but with a defragmented db, the amplification factor went down to ~2
22:33:20 <tevador> so I'm assuming you read a page from the disk and most of the data in it is from other tables
22:33:41 <hyc> except that we don't interleave data from separate tables onto a single page
22:34:47 <tevador> logical page maybe, but the actual hardware page may be larger
22:34:49 <hyc> and txn blobs will always be much smaller than 4K anyway
22:35:33 <hyc> large-sector devices still only use 4KB per sector instead of 512B
22:35:57 <tevador> SSDs nowadays use 8 or 16 KB pages
22:36:00 <hyc> that's a pretty strange result
22:36:58 <hyc> I don't think the OS cares, it will access VM-page sized blocks
22:37:47 <tevador> SSDs cannot read a smaller unit than a page
22:38:06 <tevador> and they typically cannot erase a smaller unit than 128 pages
22:38:11 <hyc> that's not true. they can't *erase* a smaller unit than a page
22:38:38 <tevador> they erase blocks, which are groups of pages
22:40:04 <hyc> will have to hunt down a data sheet later
22:41:00 <hyc> anyway, the read amplification you measured only makes sense if you used a cold cache, so internal btree pages needed to be read
22:41:49 <tevador> the whole measurement is for about 500 000 reads (average blob size ~500 bytes)
22:41:59 <hyc> and then used a warm cache for the "defragmented" DB
22:42:24 <tevador> no, the defragmented db was also cold
22:42:53 <hyc> if you know your SSD page size is 16KB you might try rerunning a test with LMDB pagesize set to 16K
22:43:39 <tevador> I guess that would require rebuilding the db file
22:43:55 <hyc> yes. mdb_dump / mdb_load will do it
22:45:18 <hyc> of course, that will also defragment the data
22:47:20 <hyc> it's true that the data is interleaved since it's written on the fly as blocks or txs arrive. but there is no interleaving within pages. so the fragmentation should only result in random seeks, nothing more
22:47:40 <hyc> it shouldn't result in excess reads per request
23:06:32 <tevador> So my OS page size is 4K, but the SSD page size is 16K. That explains the read amplification.
23:07:48 <hyc> that's kinda bad. does x86 even support 16K VM pagesize?
23:09:57 <hyc> HDDs will still be fine with 4K. I think even SMR drives are still 4K
23:10:41 <hyc> what model SSD was that?