p2p: don't disconnect manual peers for block stalling #34743

pull willcl-ark wants to merge 5 commits into bitcoin:master from willcl-ark:protect-manual-evictions changing 5 files +155 −11
  1. willcl-ark commented at 1:28 PM on March 5, 2026: member

    Ref: #5097

    Manual peers added via -addnode, -connect, or the addnode RPC represent explicit operator intent to keep those connections around.

    During IBD, a manual peer can currently be disconnected if it triggers block-stalling logic. This can be surprising in -connect or -addnode-based setups, where the operator may prefer (and probably expect) to keep the peer connected even if it is not a useful block download peer at that moment.

    This PR changes only the block-stalling path. Instead of disconnecting a stalling manual peer, it releases that peer's in-flight block requests so other peers can request them and IBD can continue.

    A one-round recovery flag avoids immediately assigning the freed blocks back to the same manual peer due to per-peer SendMessages ordering. The manual peer can request blocks again after that recovery round.

    This intentionally does not change block download timeout or headers sync timeout behavior.

  2. willcl-ark renamed this:
    Protect manual evictions
    p2p: protect manual evictions
    on Mar 5, 2026
  3. DrahtBot commented at 1:29 PM on March 5, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK w0xlt, sedited, kevkevinpal, brunoerg

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #35315 (refactor: Use NodeClock::time_point in more places by maflcko)
    • #33854 (fix assumevalid is ignored during reindex by Eunovo)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  4. DrahtBot added the label P2P on Mar 5, 2026
  5. w0xlt commented at 3:30 PM on March 5, 2026: contributor

    Concept ACK

  6. in test/functional/test_framework/test_node.py:921 in 547f119668
     916 | +
     917 | +        def addnode_callback(address, port):
     918 | +            if v2:
     919 | +                self.addnode('%s:%d' % (address, port), "onetry", v2transport=True)
     920 | +            else:
     921 | +                self.addnode('%s:%d' % (address, port), "onetry")
    


    kevkevinpal commented at 7:59 PM on March 5, 2026:

    Might be able to do this instead. It seems redundant to have the two calls.

    self.addnode('%s:%d' % (address, port), "onetry", v2transport=v2)
    

    willcl-ark commented at 10:08 AM on March 9, 2026:

    Taken in be002a7

  7. jonatack commented at 2:51 AM on March 6, 2026: member

    Thanks for picking this up, Will.

  8. sedited commented at 7:35 AM on March 6, 2026: contributor

    Concept ACK

  9. kevkevinpal commented at 3:45 PM on March 6, 2026: contributor

    Concept ACK

  10. willcl-ark force-pushed on Mar 6, 2026
  11. brunoerg commented at 2:20 PM on March 9, 2026: contributor

    Concept ACK

  12. DrahtBot added the label Needs rebase on Mar 11, 2026
  13. willcl-ark force-pushed on Mar 11, 2026
  14. DrahtBot removed the label Needs rebase on Mar 11, 2026
  15. sedited requested review from ajtowns on Mar 20, 2026
  16. in src/net_processing.cpp:6111 in 749f425094 outdated
    6113 | +                // Return early so the block download logic below doesn't
    6114 | +                // immediately re-assign them to this peer.
    6115 | +                while (!state.vBlocksInFlight.empty()) {
    6116 | +                    RemoveBlockRequest(state.vBlocksInFlight.front().pindex->GetBlockHash(), node.GetId());
    6117 | +                }
    6118 | +                state.m_stall_recovery = true;
    


    ajtowns commented at 1:36 AM on March 23, 2026:

    Would it be better for this to be a timer -- ie, if it stalls, stop requesting any blocks from this peer for X seconds? Having the peer immediately start providing blocks again seems like it sets things up for regular stall/clear/stall/clear behaviour. I could see an argument for setting X to be fairly large, eg 10 minutes, to minimise that behaviour during IBD, while still recovering gracefully after IBD is complete.

  17. in src/net_processing.cpp:6135 in accb5e4edf
    6130 | @@ -6115,27 +6131,39 @@ bool PeerManagerImpl::SendMessages(CNode& node)
    6131 |              QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
    6132 |              int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
    6133 |              if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
    6134 | -                LogInfo("Timeout downloading block %s, %s", queuedBlock.pindex->GetBlockHash().ToString(), node.DisconnectMsg());
    6135 | -                node.fDisconnect = true;
    6136 | -                return true;
    6137 | +                if (node.IsManualConn()) {
    6138 | +                    LogDebug(BCLog::NET, "Not disconnecting manual peer for block download timeout, %s", node.DisconnectMsg());
    


    ajtowns commented at 1:45 AM on March 23, 2026:

    I'm not sure this change makes sense -- if a peer can't deliver a block to us in ten minutes, it's not very useful even if it was a manual connection?

  18. in src/net_processing.cpp:6159 in accb5e4edf
    6160 | +                    // or manual peers; just reset their sync state.
    6161 |                      // Note: If all our peers are inbound, then we won't
    6162 |                      // disconnect our sync peer for stalling; we have bigger
    6163 |                      // problems if we can't get any outbound peers.
    6164 | -                    if (!node.HasPermission(NetPermissionFlags::NoBan)) {
    6165 | +                    if (!node.HasPermission(NetPermissionFlags::NoBan) && !node.IsManualConn()) {
    


    ajtowns commented at 1:50 AM on March 23, 2026:

    Likewise here -- if a peer can't reply to our getheaders request in 15 minutes it doesn't seem very useful; and the "NoBan" permission is already there for keeping connections to not very useful nodes.

  19. ajtowns commented at 1:54 AM on March 23, 2026: contributor

    The PR description is quite complicated, could it be simplified?

    I believe the approach we currently have is "if stalling in block downloads is detected, unstall by disconnecting the peer with the oldest queued block", and this introduces a new unstalling technique "reset the peer with the oldest queued block's entire download queue, but let it restart downloading blocks almost immediately".

  20. ajtowns commented at 2:05 AM on March 23, 2026: contributor

    It might be good to run a test simulating this behaviour -- I think having three nodes: A doing IBD, manually connected to B and C; B normal; C has it's bandwidth shaped by the OS to, say, 10kB/s (trickle -u 10 bitcoind ...) ? With master, that should A disconnect C for stalling; with this PR, I think it should see C repeatedly stall, and as a result see a somewhat slower IBD than if C were disconnected.

  21. willcl-ark force-pushed on Mar 24, 2026
  22. frankomosh commented at 7:49 AM on April 1, 2026: contributor

    Looked into the test coverage for the new m_stall_recovery mechanism and have some findings. I ran mutation testing to check how well the functional test covers the m_stall_recovery mechanism. I targeted lines 6100–6127 (the IsManualConn() stalling branch) and lines 6177–6188 (the m_stall_recovery skip logic).

    Setup:

    mutation-core mutate -f="src/net_processing.cpp" --range 6100 6127   # 8 mutants
    mutation-core mutate -f="src/net_processing.cpp" --range 6177 6188   # 9 mutants
    mutation-core analyze -f="muts-net_processing-cpp" \
      -c="cmake --build build -j$(nproc) && build/test/functional/p2p_ibd_stalling.py"
    

    Result: 52.94% mutation score (9 killed / 17 total)

    Four surviving mutants are directly in the PR's new code and seems to suggest the test doesn't exercise m_stall_recovery at all:

    # Survivor 1: disabling the flag entirely,  test doesn't notice
    -                state.m_stall_recovery = true;
    +                state.m_stall_recovery = false;
     
    # Survivor 2: not returning early after releasing blocks , test doesn't notice
                    state.m_stall_recovery = true;
    -                return true;
    +                return false;
     
    # Survivor 3: skip logic completely removed,  test doesn't notice
    -        if (state.m_stall_recovery) {
    +        if (1==0) {
     
    # Survivor 4: flag never resets, permanently skipping downloads, test doesn't notice
    -            state.m_stall_recovery = false;
    +            state.m_stall_recovery = true;
    

    The current test (test_manual_peer_stalling) verifies two things well:

    1. The manual peer is not disconnected (killed mutant 3.r1: fDisconnect = true → false)
    2. IBD completes through other peers

    But it does not verify that the m_stall_recovery one-round skip actually fires, or has any effect. All four mutations to the skip mechanism survive because the outbound peers download blocks fast enough that the skip is irrelevant to the test outcome.

    This is not blocking in any way, the core protection (no disconnect) is well-tested. But if m_stall_recovery is meant to be a meaningful part of the design (preventing immediate re-assignment of freed blocks), it may be worth adding a test assertion that verifies the manual peer doesn't get new block requests in the immediate round after release?

    Tested on Ubuntu 24.04


  23. p2p: don't disconnect manual peers for block stalling
    Manual connections (-addnode, -connect, addnode RPC) represent explicit
    operator intent to maintain the connection. During IBD, current block
    stalling logic can disconnect them when the download window cannot
    advance.
    
    Instead, release the stalled manual peers' in-flight block requests so
    other peers can pick them up and IBD can continue. Skip one block
    download round for that peer to avoid immediately assigning the freed
    blocks back to it.
    
    Leave block download timeout and headers sync timeout behavior
    unchanged.
    254e2d2937
  24. test: add manual p2p connection helper
    Add TestNode.add_manual_p2p_connection() for functional tests that need
    connections created through the addnode RPC.
    
    This mirrors add_outbound_p2p_connection(), while using addnode "onetry"
    so the node treats the peer as a manual connection.
    b4d2539e12
  25. willcl-ark renamed this:
    p2p: protect manual evictions
    p2p: don't disconnect manual peers for block stalling
    on May 1, 2026
  26. willcl-ark force-pushed on May 1, 2026
  27. willcl-ark commented at 9:08 AM on May 1, 2026: member

    Thanks for the reviews. Sorry for the slow reply, I was on annual leave.

    I’ve rebased and reduced the scope of the PR a bit in respone to the above. Since the previous version:

    • I now only change the IBD block-stalling path for manual peers
      • Drop protection for manual peers from block download timeout and headers sync timeout disconnects, this feels like a broader policy change which could be done later...
    • Keep the one-round recovery skip to avoid immediately assigning the released blocks back to the same manual peer due to per-peer SendMessages ordering.
      • Added test coverage for the recovery skip per suggestion #34743#pullrequestreview-4042358709
    • Updated comments, commit messages, RPC help, and release notes
  28. willcl-ark commented at 9:13 AM on May 1, 2026: member

    This is not blocking in any way, the core protection (no disconnect) is well-tested. But if m_stall_recovery is meant to be a meaningful part of the design (preventing immediate re-assignment of freed blocks), it may be worth adding a test assertion that verifies the manual peer doesn't get new block requests in the immediate round after release?

    Thanks for this. coverage was missing indeed. I added this in a standalone commit at the end for now, but it can be squashed.

    The PR description is quite complicated, could it be simplified?

    I hope I've'done this now, also I renamed it with the new narrow scope of the change.

    Would it be better for this to be a timer -- ie, if it stalls, stop requesting any blocks from this peer for X seconds? @ajtowns I do quite like this idea, but I think "add a manual-peer block download cooldown policy" seems like a broader change than what I am proposing here, which is mainly to prevent the disconnects.

  29. test: verify manual peers survive block stalling
    Check that a manual peer which stalls a requested block is not
    disconnected when block-stalling detection fires.
    
    The test also verifies that IBD can continue after the stalled manual
    peer's in-flight block request is released for other peers to serve.
    8fac9354e5
  30. doc: document manual peer block-stalling behavior
    Update addnode RPC help and release notes so manual peer documentation
    matches the new IBD block-stalling behavior.
    
    This makes the user-facing description explicit that manual peers are
    protected from DoS disconnection and block-stalling disconnection, while
    preserving the existing caveat that unsuitable peers are not synced
    from.
    402a6c77a6
  31. test: cover manual peer stall recovery scheduling
    Extend the manual peer stalling test to verify the one-round recovery skip
    that prevents immediately reassigning released blocks back to the same
    manual peer.
    
    Also verify that the manual peer can request blocks again after the recovery
    round, so the skip does not persist beyond the intended scheduling turn.
    8a8c30b3e1
  32. willcl-ark force-pushed on May 1, 2026
  33. DrahtBot added the label CI failed on May 1, 2026
  34. DrahtBot removed the label CI failed on May 1, 2026
  35. in test/functional/test_framework/test_node.py:903 in b4d2539e12
     899 | @@ -900,6 +900,33 @@ def addconnection_callback(address, port):
     900 |  
     901 |          return p2p_conn
     902 |  
     903 | +    def add_manual_p2p_connection(self, p2p_conn, *, p2p_idx, **kwargs):
    


    mzumsande commented at 2:16 PM on May 1, 2026:

    did you consider using add_outbound_p2p_connection(connection_type="manual") instead instead of having a new helper?

  36. in src/net_processing.cpp:6114 in 254e2d2937
    6116 | -            // bandwidth is insufficient.
    6117 | -            const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
    6118 | -            if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
    6119 | -                LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
    6120 | +            if (node.IsManualConn()) {
    6121 | +                LogDebug(BCLog::NET, "Not disconnecting manual peer for stalling block download, %s", node.DisconnectMsg());
    


    mzumsande commented at 2:37 PM on May 1, 2026:

    Doesn't make sense to use DisconnectMsg() when we aren't disconnecting.

  37. mzumsande commented at 3:43 PM on May 1, 2026: contributor

    Some thoughts:

    Peers are assigned 16 blocks in parallel (MAX_BLOCKS_IN_TRANSIT_PER_PEER). Whenever any peer gets a free slot, but there is no possible assignment from the 1024-block window (because all other blocks are already present or in flight), the stalling process is started. After 2s, a manual staller will be disconnected and reconnect (master), or its block are cleared (PR).

    I think what we definitely want to avoid is that the stalling peer will get re-assigned any of the block requests it was assigned before - otherwise it seems likely that it will stall right again. If it gets assigned other blocks ~1000 blocks later, it has some time to recover from what may be just a temporary phase of slowness.

    That means that in the meantime (2s + time until the manual staller gets assigned block requests again), 15 more slots must have opened up somewhere, usually by receiving 15 more blocks from other peers. The current approach may reduce this time, because the disconnection/reconnection process happening on master for manual connections will also take up a couple of seconds, whereas just skipping one SendMessages round may be quicker, so it might lead to more stalling situations than before. @aj's suggestion of a timeout would avoid that.

    The above is just me thinking how it should work, it would be interesting to verify empirically with the current branch if reassignments actually happen or the above is wrong. Maybe doing IBD on mainnet with 8 fast clearnet peers + one slower manual onion peer would be a possible setup to measure this on mainnet?


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-05-20 06:52 UTC