#monero-pow

21:29

tevador

monero-project/research-lab #98#issuecomment-1184910993
21:50

sech1

cache generation time on HDD makes it no go
21:54

tevador

it's too slow even on an SSD
21:54

sech1

I had another idea of pulling random data from blockchain (some random block number + corresponding tx_pruned from that block) at the beginning of each RandomX hash, and attaching this data to the final Blake hash. This way CPU and I/O can go in parallel, and HDD should be fast enough to be done in 10-15 ms
21:55

sech1

*random block number -> data from block and tx_pruned tables
21:56

tevador

I tried this, it's too slow. It's better to select a random block and a some random txs separately.
21:56

sech1

miners will have to run fast NVMe drivers or straight up loads of RAM
21:56

sech1

whatever, just select some random chunk of data from either table
21:56

sech1

so it could be done in one I/O operation
21:57

merope

wouldn't that penalize miners on older hardware though, thus giving the upper hand to large farms/servers/high-end hardware
21:57

tevador

miners would have to run a RAM disk or some other db structure than LMDB
21:57

sech1

miners mine in fast mode, so they have 1-2ms to read the data on each thread
21:57

tevador

or LMDB with lots of RAM so that it never has to hit the disk
21:58

sech1

Actually, typical gaming PC is 10-20 kh/s, so any NVMe capable of 10-20k IOPS will work in theory
21:58

tevador

selecting 1 block blob means several random disk accesses with LMDB
21:58

sech1

with proper buffering of completed hashes
21:59

sech1

or rather queueing of completed hashes until their db data is available
22:00

merope

Wait, what about pool miners? Or even people running multiple miners to a single node/p2pool?
22:00

tevador

they would need their own copies of the blockchain
22:00

merope

Wouldn't the network traffic/latency kill the performance?
22:01

merope

Right, but if you have to pull random data for each hash, then you'll have to wait for the node to respond
22:01

merope

So you're adding internet latency in there, no?
22:01

sech1

even "slow" NVMe drives can do way above 100k IOPS in 100% read mode: techpowerup.com/review/kingston-nv1-1-tb/4.html
22:01

sech1

100% read mode is what will happen during mining, right?
22:02

tevador

there will be some writes as new blocks are added to the db
22:02

sech1

99.9% read then :D
22:02

tevador

probably
22:03

sech1

read in the beginning of each hash -> calculate hash normally -> queue the final Blake data until the db read is finished
22:03

sech1

if I/O is done asynchronously, it shouldn't even affect the hashrate that much
22:04

tevador

the problem is that miners could skip hashes that hit blockchain data they don't have
22:04

sech1

of course db in RAM will be much faster anyway
22:04

sech1

then we can do db read in the end
22:04

sech1

more queueing, still the same hashrate
22:04

tevador

you'd want to force at least 1 RandomX program before selecting the block, maybe more
22:06

sech1

what's the size of `blocks` and `tx_pruned` tables now?
22:06

tevador

around 10 GB together, give or take
22:06

sech1

ah, so should be easy to just load them to RAM in xmrig
22:06

sech1

no need for fancy I/O
22:07

tevador

LMDB basically does this, but it works even if you don't have enough RAM
22:08

sech1

but fragmentation
22:08

tevador

miners could easily have a dedicated db with just the two tables
22:09

sech1

probably not even a db, just two files growing
22:09

sech1

because new blocks usually are added to the end
22:09

tevador

you'd need to store at least the offsets where each blob starts, some primitive database
22:10

sech1

but then, pools would just need to send new pruned blobs to miners, less than 300 Kb per block, right?
22:10

sech1

plus the initial download
22:11

sech1

but it's still a lot of bandwidth for pools
22:11

sech1

you proposal required 256 Mb every 64 blocks, so 4 Mb per block which is more than with this approach
22:12

sech1

actually, even with the original proposal, pools could just send updates every 2 minutes (pruned blobs)
22:12

sech1

*every block
22:13

sech1

initial download can be done via torrent that updates every day or week
22:14

tevador

yes, but miners would need to keep the 10 GB database
22:15

tevador

that's a botnet killer
22:15

sech1

yeah, but the point was to cripple pools
22:16

sech1

the whole idea is that each miner must have blockchain data. It turns out that it's only 10 GB and pools can send updates every block which is not as much bandwidth as initially thought
22:16

tevador

I wonder who would provide the initial download to new miners
22:16

sech1

initial download via torrent
22:16

sech1

pool can just seed from their servers
22:17

tevador

with a pruned node, you can just run it and it will sync by itself
22:17

sech1

or pruned node, yes
22:17

sech1

so pool doesn't even need to provide it
22:17

tevador

if every miner ran a node, that would be a big win
22:22

merope

<sech1> "pool can just seed from their..." <- wouldn't that just open up the pool to an easy ddos?
22:22

sech1

miners will eventually end up with "leech" node that just gets data from real nodes and updates these two tables
22:22

sech1

less resources, less disc space used
22:23

merope

And not just the pool, but also every miner who contributes would end up getting caught in it (assuming that their upload bandwidth is much smaller than a typical server)
22:23

sech1

so something like xmrig doing the hashing and "monero-db-miner-sync" updating the files in realtime
22:26

sech1

most likely setup in the end will be 1 real node per miner (they'll run own node for reliability) + all PCs in their network will run "leech" nodes to sync
22:27

sech1

so pools will still be possible, but miners will have to run nodes
22:28

hyc

your test was with a cold disk cache, so it's not a fair comparison to a RAMdisk setup
22:28

hyc

keep in mind that a RAMdisk setup requires application to manage shuttling cold/hot data in and out of RAMdisk
22:28

sech1

I'm not even sure it would kill botnets
22:28

tevador

who runs their node from a RAM disk?
22:28

hyc

whereas LMDB just needs to access blocks and let FS cache handle things
22:29

sech1

the amount of data being sent every 2 minutes is not that big, botnet can handle it with clever data distribution
22:29

hyc

my point is, LMDB will always be more efficient than any other solution
22:29

tevador

except when most of the database file is dead weight
22:29

hyc

on an active node you can assume most of the interior of the Btree is cached in RAM
22:30

hyc

therefore most data seeks will only incur 1 IOP
22:31

tevador

I measured an average "read amplification" factor of 8 with a standard pruned db file
22:31

hyc

that sounds like Linux default readahead
22:31

tevador

e.g. 2 GB actually read from the disk to read 256 MB of blob data
22:31

hyc

it always reads 64K for any 4K access
22:32

tevador

perhaps, but with a defragmented db, the amplification factor went down to ~2
22:33

tevador

so I'm assuming you read a page from the disk and most of the data in it is from other tables
22:33

hyc

except that we don't interleave data from separate tables onto a single page
22:34

tevador

logical page maybe, but the actual hardware page may be larger
22:34

hyc

and txn blobs will always be much smaller than 4K anyway
22:35

hyc

large-sector devices still only use 4KB per sector instead of 512B
22:35

tevador

SSDs nowadays use 8 or 16 KB pages
22:36

hyc

that's a pretty strange result
22:36

hyc

I don't think the OS cares, it will access VM-page sized blocks
22:37

tevador

SSDs cannot read a smaller unit than a page
22:38

tevador

and they typically cannot erase a smaller unit than 128 pages
22:38

hyc

that's not true. they can't *erase* a smaller unit than a page
22:38

tevador

they erase blocks, which are groups of pages
22:40

hyc

will have to hunt down a data sheet later
22:41

hyc

anyway, the read amplification you measured only makes sense if you used a cold cache, so internal btree pages needed to be read
22:41

tevador

the whole measurement is for about 500 000 reads (average blob size ~500 bytes)
22:41

hyc

and then used a warm cache for the "defragmented" DB
22:42

tevador

no, the defragmented db was also cold
22:42

hyc

if you know your SSD page size is 16KB you might try rerunning a test with LMDB pagesize set to 16K
22:43

tevador

I guess that would require rebuilding the db file
22:43

hyc

yes. mdb_dump / mdb_load will do it
22:45

hyc

of course, that will also defragment the data
22:47

hyc

it's true that the data is interleaved since it's written on the fly as blocks or txs arrive. but there is no interleaving within pages. so the fragmentation should only result in random seeks, nothing more
22:47

hyc

it shouldn't result in excess reads per request
23:06

tevador

So my OS page size is 4K, but the SSD page size is 16K. That explains the read amplification.
23:07

hyc

that's kinda bad. does x86 even support 16K VM pagesize?
23:09

hyc

HDDs will still be fine with 4K. I think even SMR drives are still 4K
23:10

hyc

what model SSD was that?

3 years ago

« 5 days earlier

a day later »

today »