chainstate leveldb is rewritten every 30 or 60 minutes #35298

issue tuxArg opened this issue on May 15, 2026
  1. tuxArg commented at 8:43 PM on May 15, 2026: none

    Is there an existing issue for this?

    • I have searched the existing issues

    Current behaviour

    I try to understand why all .ldb on chainstate are regenearated every 60 minutes (sometimes 30 minutes). It seems to be some cron that triggers it. I would really want to avoid this as it writes 10GB of data on each 30 or 60 minutes period.

    cat debug.log | grep '[leveldb] Level-0' | grep started 2026-05-15T12:18:02Z [leveldb] Level-0 table #1356: started 2026-05-15T12:18:05Z [leveldb] Level-0 table #75238: started 2026-05-15T12:18:06Z [leveldb] Level-0 table #68702: started 2026-05-15T13:18:21Z [leveldb] Level-0 table #75501: started 2026-05-15T13:48:21Z [leveldb] Level-0 table #75761: started 2026-05-15T14:18:21Z [leveldb] Level-0 table #76022: started 2026-05-15T14:48:21Z [leveldb] Level-0 table #76285: started 2026-05-15T15:18:21Z [leveldb] Level-0 table #76550: started 2026-05-15T15:48:21Z [leveldb] Level-0 table #76813: started 2026-05-15T16:48:21Z [leveldb] Level-0 table #77078: started 2026-05-15T17:48:21Z [leveldb] Level-0 table #77348: started 2026-05-15T18:48:21Z [leveldb] Level-0 table #77620: started 2026-05-15T19:48:21Z [leveldb] Level-0 table #77895: started

    After each of this event logs there are many log entries like: 2026-05-15T19:48:21Z [leveldb] Delete type=2 #77536 2026-05-15T19:48:22Z [leveldb] Compacting 1@1 + 0@2 files 2026-05-15T19:48:22Z [leveldb] Generated table #77896@1: 11927 keys, 558881 bytes 2026-05-15T19:48:22Z [leveldb] compacted to: files[ 0 0 3 32 487 2 0 ]

    All the process last 40 seconds and finishes like: 2026-05-15T19:49:03Z [leveldb] Delete type=2 #77898

    Expected behaviour

    I don't expect that chainstate db shoud be regenerated every hour. Or better if I could configure how often it does it.

    Steps to reproduce

    relevant bitcoin.conf: server=1 txindex=1 dbcache=800 maxmempool=100 debug=leveldb

    Relevant log output

    No response

    How did you obtain Bitcoin Core

    Pre-built binaries

    What version of Bitcoin Core are you using?

    30.2

    Operating system and version

    ubuntu 24.04

    Machine specifications

    I run it inside a podman container on a ext4 fs shared.

  2. pinheadmz commented at 7:47 PM on May 17, 2026: member

    I think this comment in the code answers most of your question?

    https://github.com/bitcoin/bitcoin/blob/7802e578c3f1e9a5d9b57fb003349d0e032bb43b/src/validation.cpp#L93-L98

    I asked claude to help explain the rest of it:

    Why 10 GB gets written. The actual dirty data written per flush is small — just the UTXO changes from the last hour's blocks. The 10 GB write you're seeing is LevelDB's internal compaction, which is triggered by the flush but rewrites far more data.

    With -dbcache=800 and -txindex, your coins cache gets roughly 700 MB. LevelDB is configured with

    write_buffer_size = cache / 4 ≈ 175 MB

    With LevelDB's default L0_CompactionTrigger = 4, four level-0 files (≈700 MB) trigger compaction into level-1, which then cascades: level-1 (10 MB max) → level-2 (100 MB) → level-3 (1 GB) → level-4 (10 GB). The entire chainstate lives at the deeper levels, so a cascade rewrite of ~10 GB is normal.

  3. tuxArg commented at 12:42 AM on May 18, 2026: none

    @pinheadmz I appreciate you found exactly the lines that seem to trigger this issue. My question is why do we need this.

    Just before db compation everything works fine so there's no actual urgency to do this every hour. If it can be done 24 times a day, it could probably be done just once a day too.

    So, why is it fixed at 50 to 70 minutes then? It would be better to have an option to configure it.

  4. l0rinc commented at 7:47 AM on May 18, 2026: contributor

    So, why is it fixed at 50 to 70 minutes then? It would be better to have an option to configure it.

    This was added in #30611, see the motivation in the PR description.

    I try to understand why all .ldb on chainstate are regenearated every 60 minutes

    That's not what happens; the permanent storage is updated instead of keeping everything in memory. Can you please tell us what the problem is that you're trying to solve?

    After each of this event logs there are many log entries like

    You don't need to enable LevelDB debug logging. This is normal behavior: the state is written to disk (LevelDB), which does some cleanup (background compaction) to keep disk access optimal. Again, this is a feature, not a bug. If we leave things in memory for too long, any interrupt (e.g. a crash) would wipe out that state and you would need to redo the work.

    I don't expect that chainstate db shoud be regenerated every hour.

    It's not, it's just updated regularly to avoid data loss. We could theoretically bump the 50/70 minute range to 90/110 minutes if users think the current interval is too frequent -- what do you think @andrewtoth?

  5. tuxArg commented at 8:52 AM on May 18, 2026: none

    This was added in #30611, see the motivation in the PR description.

    I've just read it and all its thread.

    That's not what happens; the permanent storage is updated instead of keeping everything in memory. Can you please tell us what the problem is that you're trying to solve?

    I run a bitcoin node and as many running one I run it continuously. What I'm trying to solve is I/O disk usage (writes in this case). Each flush to disk it writes around 10GB of data on a full node. That's 240GB a day. #30611 focused on reducing spikes and avoid redoing work after power outages. Power loss can happen but it's not frequent enough to justify optimizing for it at the expense of long running operation.

    Reducing spikes is in right direction but what about total data written to disk? How much data was written before every day? I don't think it was 240GB because I would have noticed that spike.

    You don't need to enable LevelDB debug logging.

    I have enabled it to find out what was happening. 10GB every hour is still too much to be unnoticed.

    It's not, it's just updated regularly to avoid data loss. We could theoretically bump the 50/70 minute range to 90/110 minutes if users think the current interval is too frequent -- what do you think @andrewtoth?

    It's not to avoid data loss. It was to avoid work to be redone, but we can recover that data, so it's not data loss. I think it would be better to let users to configure the interval in bitcoin.conf

  6. andrewtoth commented at 12:08 PM on May 18, 2026: contributor

    @tuxArg did you recently upgrade from version 28 or earlier to version 30.2? In version 29 and up the leveldb max file size was increased from 2MB to 32MB, so compactions will take a long time for the first while until all 2mb files have been compacted to 32mb.

    However, that doesn't explain the frequency of compactions. We now write to the chainstate leveldb every ~hour, but just writing to the db does not trigger a compaction every time.

  7. JohnTravolski commented at 12:44 PM on May 18, 2026: none

    Hi, I am also concerned about this. I monitor cumulative disk writes using smartmontools since I'm running on an SSD. I don't want to kill it early since SSDs have limited writes. Previously my node was writing 20 GB / day (Core 27.0 + Fulcrum indexer), but after I upgraded to Core 31.0 it was writing about 220 GB / day, 11 times more. I used sudo iotop -oPa and let it sit for a few hours to ensure it was attributable to the bitcoin-qt process.

  8. l0rinc commented at 1:15 PM on May 18, 2026: contributor

    @andrewtoth's right, it's probably the file size changes after #30039. After it's done compacting, you will have fewer writes than before. This shouldn't kill an SSD - it took me almost 2 years to kill mine, often doing several full reindex-chainstates per day :)

  9. GURGPqxVwj commented at 1:55 PM on May 18, 2026: none

    I would like to add another point here.

    I read the comments above. I understand that Bitcoin Core intentionally writes the chainstate to persistent storage roughly every 50–70 minutes, and that LevelDB compaction can rewrite much more data than the actual coins cache flush.

    My concern is mainly the total SSD write volume. I still do not understand how a fully synced node in normal operation can write hundreds of GB per day to the SSD. I am not talking about IBD or reindex here.

    I have seen the same general pattern on two different hardware platforms:

    • first on a low-power mini PC with 4 GB RAM
    • now on a newer thin client with 16 GB RAM

    I have also seen the same general pattern with different Bitcoin Core versions:

    • Bitcoin Core 30.2
    • Bitcoin Core 31.0, after upgrading only the binaries and keeping the same datadir

    The current data below is from the newer system, because that setup is cleaner and easier to trust.

    Current setup:

    • Ubuntu Server 24.04.4
    • Bitcoin Core 31.0
    • same datadir previously used with 30.2
    • external NVMe SSD, ext4, noatime
    • not running in a container
    • no txindex
    • dbcache=2048
    • wallet disabled
    • debug=bench
    • debug=leveldb
    • node fully synced, not IBD
    • no kernel I/O errors
    • no EXT4 errors
    • SMART health PASSED, media errors 0
    • Linux block write counter and NVMe SMART Data Units Written match almost exactly

    I mention both the Linux block write counter and the NVMe SMART counter because I wanted to make sure I am not just misreading one tool or measuring some local artifact. During the compaction waves, both counters increased by essentially the same amount, so the writes seem to be real device writes.

    There was a large compaction wave directly after starting 31.0. That seems reasonable to me. The node had to catch up about 60 blocks and LevelDB recovery/startup activity was happening at the same time.

    The more interesting part happened later, after startup/catch-up was finished.

    What I repeatedly observe is this pattern:

    1. For some time, the node behaves as I would expect. New blocks arrive, UpdateTip is logged, and the reported cache grows.

    2. Then a point is reached where the reported UpdateTip cache stops growing. This is what I mean by "plateau" here.

    3. The exact cache value is not always the same. In this 31.0 run the plateau was around 44.0 MiB. In earlier observations I saw the same general pattern begin at other reported cache values. So I do not think 44 MiB itself is a fixed threshold.

    4. From that point on, new blocks are still connected, but the reported cache stays pinned.

    5. Then LevelDB compaction waves start.

    6. After a compaction wave, the reported cache does not continue growing again. It stays pinned, and later more compaction waves can follow.

    Here is the 31.0 observation.

    Shortly before the plateau, there was a normal FlushStateToDisk / BatchWrite:

    • BatchWrite: write coins cache to disk (330525 out of 337279 cached coins)
    • WriteBatch memory usage: db=chainstate, before=0.0MiB, after=26.3MiB

    After that, new blocks were connected normally. But the reported UpdateTip cache reached about 44.0 MiB and stopped growing.

    In this run I observed at least 17 consecutive UpdateTips with the reported cache at about 44.0 MiB:

    • height 949927: cache=44.0MiB
    • height 949928: cache=44.0MiB
    • height 949929: cache=44.0MiB
    • ...
    • height 949938: cache=44.0MiB
    • later also up to at least height 949943: cache=44.0MiB

    The txo count changed during that time, so the node was not idle. New blocks were connected, but the reported cache stayed pinned.

    After this cache plateau, a large LevelDB compaction wave happened.

    From my report, counted since the marker:

    • Compactions: 35
    • Generated tables: 389
    • Deleted tables: 419
    • Generated bytes: 10.903 GiB
    • Compacted bytes: 10.903 GiB
    • chainstate files since marker: 348 files, size-sum about 10.565 GiB

    The SSD write counters matched this closely:

    • Linux written since marker: 11.022 GiB
    • SMART written since marker: 11.022 GiB
    • largest measured interval: 9.933 GiB in 15.2 minutes
    • SMART-Linux difference: 0.000 GiB

    There were no storage errors in dmesg.

    The compaction wave was mostly the familiar pattern of many generated ~34.5 MB .ldb files. The later part contained several lines like:

    • Compacting 1@4 + 10@5 files
    • Generated table ... about 34.5 MB
    • Compacted ... about 345 MB

    What looks important to me is that the cache did not start growing again after the large compaction wave. It stayed around 44.0 MiB for further UpdateTips, and more LevelDB activity happened later.

    So to me the pattern does not look like "one cache flush, then one cleanup, then normal cache growth again". It looks more like the node reaches a state where the reported cache stops growing, and while it stays in that state, compaction waves repeat.

    So what I am trying to understand is this:

    Is this amount of write amplification expected for a fully synced node in normal operation?

    If the answer is "yes, this is expected", then I would like to understand why a synced node can write this much data to the SSD, and whether there is a way to tune this behaviour.

    I can provide selected debug.log snippets and the small write-counter reports if that would help.

  10. andrewtoth commented at 2:14 PM on May 18, 2026: contributor

    @GURGPqxVwj

    Since you have an already synced chainstate, was this synced on a pre-v29 node? If so, that explains the large compactions and the problem will resolve itself after some time. The hundreds of GB being written per day is due to the large compactions from 2mb -> 32mb leveldb files.

    Regarding the cache value in the UpdateTip - this value can decrease as well after some blocks and is normal. The cache increases when a block creates more outputs than it spends. If most transactions have the same number of inputs and outputs, the value will not increase. Incoming mempool transactions will also increase the cache value. Also, every ~hour the cache is written to disk, and during this process all spent entries from the cache are removed. This reduces the number of entries in the cache.

  11. tuxArg commented at 2:14 PM on May 18, 2026: none

    @tuxArg did you recently upgrade from version 28 or earlier to version 30.2? In version 29 and up the leveldb max file size was increased from 2MB to 32MB, so compactions will take a long time for the first while until all 2mb files have been compacted to 32mb.

    I did. 27.1 -> 29 -> 30 -> 30.2 I also did reindex-chainstate before reporting this as I thought it could be the cause. But still same behavior. I mostly see 33MB ldb files:

    $ ls -l chainstate/*.ldb | awk '{printf "%.0fM\n", $5/1048576}' | sort -n | uniq -c
         71 0M
          1 1M
          5 2M
          1 5M
          1 7M
          2 8M
          1 9M
          3 14M
          1 16M
          1 18M
          2 22M
          2 23M
          2 25M
          2 26M
          2 27M
          2 31M
        552 33M
    

    However, that doesn't explain the frequency of compactions. We now write to the chainstate leveldb every ~hour, but just writing to the db does not trigger a compaction every time.

    Well, that is the bug I'm reporting here. If 10 GB must be written then we need a way to configure how often, but if they are not meant to be rewritten every hour, then we have a bug here that triggers full compaction every time.

  12. andrewtoth commented at 2:23 PM on May 18, 2026: contributor

    @tuxArg thanks, I see. I did measure disk usage in #30611 (comment) and did not see this issue of frequent compactions.

    I will investigate and see if any nodes I run are also experiencing this behavior.

  13. GURGPqxVwj commented at 3:18 PM on May 18, 2026: none

    @andrewtoth Thanks, that explanation helps.

    To answer your question: no, this chainstate was not synced on a pre-v29 node. In my setup the chainstate was created with Bitcoin Core 30.2. I never used v28 or older for this datadir. Later I upgraded only the Bitcoin Core binaries to 31.0 and kept the same datadir.

    So I do not think the 2 MB -> 32 MB migration from a pre-v29 chainstate explains my case, unless I misunderstand something.

    Your explanation about the UpdateTip cache makes sense. I understand now that this value does not have to grow monotonically and can decrease or stay flat depending on the blocks and cache flushes.

    However, after looking at the current debug.log for longer, the large compactions do not seem to be just a one-time catch-up or startup effect. After the node was running normally on 31.0, the reported cache stayed at 44.0 MiB from height 949927 to at least height 949954, while new blocks were connected.

    During this same period I see repeated large LevelDB compaction waves:

    • 11:20:42Z–11:29:11Z: about 10.23 GiB compacted/generated
    • 12:08:09Z–12:29:13Z: about 10.55 GiB
    • 13:06:07Z–13:23:47Z: about 10.55 GiB
    • 14:16:59Z–14:30:32Z: about 10.23 GiB

    Since my marker at 09:02:59Z, the debug log shows 137 compactions and about 42.6 GiB generated/compacted LevelDB output.

    So my remaining question is mainly about the amount and recurrence of this write amplification. Is that still expected for a chainstate created with 30.2, or would you expect those large compactions to settle down after some runtime?

  14. l0rinc commented at 3:34 PM on May 18, 2026: contributor

    Thanks for the details reports!

    would you expect those large compactions to settle down after some runtime

    I would also expect them to decrease, but it's normal to have some spikes temporarily. Please let us know if this continues for the following days.

    I will investigate and see if any nodes I run are also experiencing this behavior.

    Thanks.

  15. iotamega commented at 4:22 PM on May 18, 2026: none

    It's not to avoid data loss. It was to avoid work to be redone, but we can recover that data, so it's not data loss. I think it would be better to let users to configure the interval in bitcoin.conf

    +1 to allow this to be a configurable option. I am seeing similar issues across many nodes. Having the ability to configure this would be helpful.

  16. ArmchairCryptologist commented at 5:14 PM on May 18, 2026: none

    I'm seeing similar behavior. Two full nodes with the chainstate on an SSD and the blocks on an HDD, running 30.2 and 31.0 respectively, have written on average 13.6 GB/hour and 13.8 GB/hour to their SSDs since last reboot ~12 days ago. Both of them mostly have ~32MB ldb files in the chainstate directory, so the aforementioned compaction seems to have completed, but the write rate is still the same.

    Bitcoin Core version v31.0.0 (release build)
    Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
    sda              41.47      3203.81      4042.79      5447.53 3300410450 4164682942 5611784428
    
    Bitcoin Core version v30.2.0 (release build)
    Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
    sdb              41.05      3155.59      3979.64         0.00 3258175332 4109016343          0
    

    Double-checked the disk stats in the hypervisor for the latter, and the write activity is still ongoing, averaging 5.4 MB/s in the last hour.

    At ~330 GB/day it could take less than a year to reach the rated write endurance for some recent QLC SSDs - for example, the WD Green SN3000 500GB is rated for only 100 TB lifetime writes, which it would reach after only ~300 days at this rate - so this probably needs to be addressed.

  17. sipa commented at 6:15 PM on May 18, 2026: member

    To the people reporting this issue, what are the ages of your DATADIR/chainstate/*.ldb files? Specifically, what percentage (in terms of byte size) are up to a few hours old? LevelDB database files are immutable, so if ~all files are very recent, that would mean that it is indeed rewriting the whole database on every flush.

    Just to be clear, writing something every 50-70 minutes is expected, but after initial sync (and possibly post-29.0 conversion to 32 MiB files), it shouldn't be writing gigabytes every time. If it's actually rewriting the whole thing all the time, then that is a bug.

    Also, can you share your bitcoin.conf or other notable configuration options? It's clearly not happening to everyone, so there must be something in your configuration that triggers it.

  18. andrewtoth commented at 6:24 PM on May 18, 2026: contributor

    From what I can gather, I think restarting once with -forcecompactdb=1 should help here. It seems there are a lot of scattered files at different levels from [leveldb] compacted to: files[ 0 0 3 32 487 2 0 ], and that could be impacting something here.

    Also, when we bumped max_file_size, we should probably also have bumped write_buffer_size. The latter is implicitly the l0 file size, so every hourly write is also creating a new l0 file. That will make compaction happen more frequently (although it shouldn't cause the whole db to be rewritten each time).

  19. tuxArg commented at 6:27 PM on May 18, 2026: none

    @sipa In my case, chainstate dir has 18GB and 10GB are from the last hour. @andrewtoth I will restart with forcecompactdb=1 and I'll tell you on a few hours if it keeps doing it.

  20. ArmchairCryptologist commented at 6:54 PM on May 18, 2026: none

    @sipa In bytes, >99% of the idb files have been touched in the last two hours on both nodes.

    Outside of binds/addnodes/etc, these are probably the relevant settings:

    disablewallet=1
    dbcache=1000
    maxmempool=1000
    persistmempool=0
    txindex=1
    server=1
    

    Will test starting with forcecompactdb.

  21. tuxArg commented at 7:32 PM on May 18, 2026: none

    From what I can gather, I think restarting once with -forcecompactdb=1 should help here. It seems there are a lot of scattered files at different levels from [leveldb] compacted to: files[ 0 0 3 32 487 2 0 ], and that could be impacting something here.

    I've restarted with forcecompactdb=1. It regenerated all files in chainstate dir. All small files disappeared. But after one hour.. It compacted all again and 10GB were written again to disk according to iostat.

    $ ls -l chainstate/*.ldb | awk '{printf "%.0fM\n", $5/1048576}' | sort -n | uniq -c 16 0M 2 11M 2 17M 1 21M 1 22M 2 25M 2 30M 2 32M 648 33M

    I run it with debug=leveldb if that's useful to debug.

  22. sipa commented at 7:54 PM on May 18, 2026: member

    I note both @ArmchairCryptologist and @tuxArg have -txindex enabled. I wonder if that is related; I'm enabling it on my test system too.

  23. iotamega commented at 8:03 PM on May 18, 2026: none

    I note both @ArmchairCryptologist and @tuxArg have -txindex enabled. I wonder if that is related; I'm enabling it on my test system too.

    Do as well on my end.

    txindex=1 coinstatsindex=1 v2transport=1 listen=1 port=8333 listenonion=0 shrinkdebugfile=0 debug=1 logips=1 loglevelalways=1 logtimemicros=1 printpriority=1 #capturemessages=1

  24. andrewtoth commented at 8:45 PM on May 18, 2026: contributor

    I was seeing this write amplification on my nodes as well. I instrumented leveldb with some logging to find the issue. It seems due to seek compactions. Due to the large database with random keys, there are many levels that each read will go through and trigger a seek compaction on the way to finding the entry. Seek compactions are not really useful for our workload, so it's possible to just disable it. Size compactions will still occur, so the db will still remain balanced.

    Disabling this would also resolve #29662.

    The following patch fixes the issue for me:

    diff --git a/src/leveldb/db/version_set.cc b/src/leveldb/db/version_set.cc
    index cd07346ea8..35a533a3d1 100644
    --- a/src/leveldb/db/version_set.cc
    +++ b/src/leveldb/db/version_set.cc
    @@ -7,6 +7,7 @@
     #include <stdio.h>
     
     #include <algorithm>
    +#include <limits>
     
     #include "db/filename.h"
     #include "db/log_reader.h"
    @@ -648,21 +649,8 @@ class VersionSet::Builder {
           FileMetaData* f = new FileMetaData(edit->new_files_[i].second);
           f->refs = 1;
     
    -      // We arrange to automatically compact this file after
    -      // a certain number of seeks.  Let's assume:
    -      //   (1) One seek costs 10ms
    -      //   (2) Writing or reading 1MB costs 10ms (100MB/s)
    -      //   (3) A compaction of 1MB does 25MB of IO:
    -      //         1MB read from this level
    -      //         10-12MB read from next level (boundaries may be misaligned)
    -      //         10-12MB written to next level
    -      // This implies that 25 seeks cost the same as the compaction
    -      // of 1MB of data.  I.e., one seek costs approximately the
    -      // same as the compaction of 40KB of data.  We are a little
    -      // conservative and allow approximately one seek for every 16KB
    -      // of data before triggering a compaction.
    -      f->allowed_seeks = static_cast<int>((f->file_size / 16384U));
    -      if (f->allowed_seeks < 100) f->allowed_seeks = 100;
    +      // Disable seek compaction for our workload
    +      f->allowed_seeks = std::numeric_limits<int>::max();
     
           levels_[level].deleted_files.erase(f->number);
           levels_[level].added_files->insert(f);
    
  25. ArmchairCryptologist commented at 8:46 PM on May 18, 2026: none

    I think txindex might be a red herring, but it does seem to amplify it somewhat. I finished checking my other nodes, all of which are pruning nodes without txindex enabled, running either 30.2 or 31.0, and they all have iostat reporting between 7.5 GB/hour and 8.3 GB/hour written since last system restart (12 days for all of them). Which is notably less than the txindex nodes, but still substantial. These also all have all ibf files in chainstate touched in the last 3-4 hours.

    These all run these settings, some have enabled wallet and some do not:

    dbcache=1000
    maxmempool=1000
    persistmempool=0
    

    I can't say for sure how old the chainstate database is on most of these nodes since I usually just sync new nodes from an existing one instead of doing IBD, but at least one of the full nodes did IBD no later than May 2020 based on the file timestamps on the blocks database, so it might be related to that.

    PS: I can also confirm that doing forcecompactdb=1 did not resolve it; while it did eliminate some leftover small ldb files on startup, the chainstate database was fully rewritten again a couple of hours after startup.

  26. GURGPqxVwj commented at 9:02 PM on May 18, 2026: none

    @andrewtoth @sipa

    Thanks, this is very helpful.

    This sounds consistent with what I am seeing. In my case, txindex is not enabled and the chainstate was not created on a pre-v29 node, but I still see the recurring large rewrites.

    I also checked the chainstate .ldb file ages by bytes:

    • files total: 397
    • bytes total: 10.565 GiB
    • <= 1h: 10.231 GiB (96.84%)
    • <= 2h: 10.554 GiB (99.89%) So almost the whole current chainstate .ldb set was very recently rewritten.

    I will leave the node running unchanged overnight and report the write volume and file age distribution again tomorrow.

  27. andrewtoth commented at 10:58 PM on May 18, 2026: contributor

    @sipa Each periodic sync every ~hour will produce an l0 file a little over 2 MiB, because max_write_buffer is 2MiB and anything over gets written to l0. Now this file gets allowed_seeks = max(file_size / 16 KiB, 100) ~= 128 seeks before it gets seek compacted. 128 random reads going through this file will happen almost immediately from mempool or next block, and then it gets compacted without waiting for the 4 l0 files that trigger size compaction. This will produce a smaller l1 file than size compaction will, so it also has a smaller seek budget. Random reads will drain that seek budget quickly, and it will get scheduled for compaction again.

    The seek compaction mechanism was designed for spinning disk reads. Not sure we need it. Another option is to increase max_write_buffer to ~32 MiB so it doesn't produce an l0 file every sync, and when it does the l0 file at least has a higher seek budget.

  28. tuxArg commented at 11:41 PM on May 18, 2026: none

    @andrewtoth What about just removing this line:

    if (f->allowed_seeks < 100) f->allowed_seeks = 100;
    

    Isn't it enough?

  29. tuxArg commented at 12:22 AM on May 19, 2026: none
     //   (1) One seek costs 10ms
     //   (2) Writing or reading 1MB costs 10ms (100MB/s)
     //   (3) A compaction of 1MB does 25MB of IO:
     //         1MB read from this level
     //         10-12MB read from next level (boundaries may be misaligned)
     //         10-12MB written to next level
     // This implies that 25 seeks cost the same as the compaction
     // of 1MB of data.

    The economics around this are not OK. This assumes that the variable is how long a task lasts, rather than what resources it uses. A modern CPU has multiple cores but may have only one or two disks. Disk time costs much more than single core/thread time. Even if we use SSDs and have more I/O bandwidth than 100MB/s, a single seek likely costs much less than 10 ms

  30. andrewtoth commented at 12:36 AM on May 19, 2026: contributor

    @andrewtoth What about just removing this line:

    if (f->allowed_seeks < 100) f->allowed_seeks = 100;
    

    Isn't it enough?

    Removing just that line will cause small files to have an even smaller seek budget, so it would actually make this problem worse. We want all files to have a large seek budget.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-05-19 06:51 UTC