scripted-diff: structural risk in the eval-based verifier #35275

furszy commented at 7:13 PM on May 12, 2026: member

Scripted-diff makes large mechanical refactors reviewable. The author includes a small script, the "recipe", between -BEGIN VERIFY SCRIPT- and -END VERIFY SCRIPT- markers in the commit message. Contributors then review the recipe instead of reviewing the mechanical diff line by line. Then, commit-script-check.sh runs the recipe and checks that the result matches the committed diff. The premise is that reviewing a few lines of shell is less error-prone than reviewing a long mechanical diff.

Today, that script recipe is passed to eval as-is, with no sanitization or sandboxing.

This means the security of the mechanism depends solely on reviewers decoding the exact script program that will later run on their machines. Which is a weaker assumption than it looks.

There are two separate problems:

The script reviewers may not see the same program the shell executes.
Even when the program is visible and looks innocent, the recipe is still arbitrary shell that can be doing more than what is expected. The diff check only confirms reproducibility, not intent.

For the first case, a way to bypass the reviewer's eyes is to embed commands in Unicode characters that are not displayed in standard browsers, terminals, and editors. Zero-width characters such as U+200B, U+200C, U+200D, U+FEFF, and others are actual characters in the byte stream that are not rendered. Pairs or sequences of them can encode data inside what appears to reviewers as ordinary whitespace.

That gives an attacker a way to hide commands inside a scripted-diff block. The visible recipe can look benign, while the shell receives something else. The result is arbitrary code execution on any system that runs commit-script-check.sh against that commit.

This is the class of issues described by Boucher and Anderson in the Trojan Source paper: https://arxiv.org/abs/2111.00169. The paper focuses on source code, but the same review gap applies here because the verifier executes a raw script from the commit message.

The immediate patch one might think of is to limit the script block to printable ASCII, which closes the Unicode-based vectors. But while necessary, that is not enough on its own.

Why a printable ASCII-only fix is not sufficient

The structural problem is that eval runs whatever bytes are inside the markers, no matter how those bytes got there. There are at least two paths the ASCII filter doesn't touch.

1) ASCII has its own invisible characters

A pure-ASCII payload can be encoded in whitespace patterns. Tabs and spaces are different bytes but often look identical in editors, review tools and terminals. Trailing whitespace is also frequently invisible. A small decoder can reconstruct a payload from those patterns.

2) The payload doesn't have to be in the commit message at all

This is the XZ utils pattern (CVE-2024-3094). The malicious bytes lived in binary test fixtures, and the build machinery extracted them at build time. Reviewers were watching the build scripts, not the test fixtures, because test fixtures don't look like code.

The same pattern applies here. The recipe can stay completely clean while the payload lives in another file added by the same or other PR. Anywhere reviewers don't read carefully line by line. The recipe doesn't need to contain anything suspicious.

The main point here is that with eval, the limit is on the attacker's creativity.

Why the diff check isn't sufficient

The verifier checks that the recipe reproduces the committed diff. That is a reproducibility check, not a safety check.

The recipe and the diff are submitted by the same author. The check confirms they're consistent with each other, not that either is safe. The recipe only needs to look benign; the diff itself or what the script executes does not have to.

Even worse, reviewers rarely read the full diff. The whole premise of scripted-diff is that they focus on the recipe instead, which is exactly the dangerous reading mode for attacks that split behavior across different inputs.

What scripted-diffs actually do for us

Before suggesting anything to move on, here is what scripted-diffs are actually used for. I wrote an analyzer that walks every scripted-diff commit in the repository history and classifies the transformations based on the tools used and the outcome.

Out of 394 scripted-diff commits, about 95% fall into a small set of operations:

73% full-identifier or regex-anchored renames
14% function-call argument restructuring
8% literal text substitutions
5% line or inline deletions
5% copyright bumps
4% file moves or renames
4% include-path updates
1% namespace prefix or member renames

Note: The percentages add to more than 100% because some commits contain more than one transformation.

The commands themselves are not that many. Across all 394 commits, this is what we have:

sed: 709 invocations across 335 commits
xargs: 74 invocations across 61 commits
git grep: 53 invocations across 38 commits
git mv: 37 invocations across 18 commits
./contrib/devtools/copyright_header.py: 19 invocations across 13 commits
grep: 15 invocations across 14 commits
perl: 11 invocations across 10 commits
git ls-files: 7 invocations across 7 commits

Everything else appears only a handful of times: find, git rm, git apply, git diff, git show, git archive, tar, mkdir, clang-format-diff, and bash -c.

So the current mechanism gives a lot of freedom, but the actual use cases are much narrower: mostly text rewrites and file moves.

Some real examples

Some existing scripted-diff recipes are dense enough that reviewing them requires shell expertise. For examples 0184d33b from PR #31072 and 9d1dbbd from PR #29404.

Those are merged, reviewed commits, and I am not suggesting there is anything wrong with them. They are useful examples because they show the review burden we already accept today.

The question is whether the verifier needs to give recipes full shell power for this class of change.

How other projects handle similar workflows

The Linux kernel uses Coccinelle for refactors and API migrations. The tool has no primitive for running arbitrary commands.

LLVM and Chromium use clang-rename for similar refactors. These operate on the AST through a fixed set of typed transformations, not by running shell. The user invokes a specific rename rule or a clang-tidy check, with no way to run arbitrary code.

The common pattern is a fixed set of operations: none of these projects let a contributor's refactor commit run arbitrary code on a reviewer's or CI machine.

Investigated Paths

1) Adopting clang-rename

This covers C++ identifier renames cleanly (~74% of historical scripted-diffs), but the remaining 26% (file moves, non-C++ rewrites, and so on) needs separate handling. So clang-rename is useful, but it is not a full replacement for scripted-diff.

2) Sandboxing

We could keep the eval call but restrict what the attacker can do. Scoping it only to the repository directory, no network, no IPC. Linux has several tools for this: bubblewrap, firejail, nsjail, landlock. This would limit data exfiltration, writes outside the repo, and other related attacks. But alone it doesn't solve the overall review problem. Sandboxing restricts what the recipe does, not which recipe runs. So sandboxing + non-printable ASCII rejection is reasonable defense but adds platform-specific machinery that has to work on every developer's setup. Which may not be the best.

3) Restrict shell

The idea here is to keep eval, but constrain what recipes can run: a list of permitted commands, ASCII filtering, no control flow.

The issue is that many of the commands recipes use are themselves arbitrary-code-execution tools. perl -e runs arbitrary Perl, find -exec runs whatever you point it at, GNU sed's e flag runs shell, and xargs can re-launch shell via xargs sh -c '...'. So, to make this actually work, we would have to filter at the flag level and trace indirect invocation through tools like these.

In essence, if we go down this route, we would be creating a complex typed-grammar approach, just with shell underneath.

4) Replace shell with typed primitives

We can drop shell entirely and replace it with a small set of operations with named arguments. Instead of treating the recipe as code, treat it as data. The verifier reads the recipe, validates each operation against a schema, and dispatches to our well-reviewed python implementation.

Two primitives cover most of what historical scripted-diffs do:

RENAME mode old new files: text substitution in a list of files. mode is one of word (matches whole identifiers only, like a portable \b...\b), literal (exact string match), or regex (python's regex).
RENAME_FILE src dst: file move via git mv.

Example

-BEGIN VERIFY SCRIPT-
RENAME literal "enum DBErrors" "enum class DBErrors" src/wallet/walletdb.h
RENAME word DB_LOAD_OK DBErrors::LOAD_OK src/**/*.cpp src/**/*.h
RENAME word DB_CORRUPT DBErrors::CORRUPT src/**/*.cpp src/**/*.h
-END VERIFY SCRIPT-

A regex-mode rewrite, for example simplifying HexStr(buf.begin(), buf.end()) to HexStr(buf):

-BEGIN VERIFY SCRIPT-
RENAME regex 'HexStr\(([^(]+)\.begin\(\), *([^(]+)\.end\(\)\)' 'HexStr(\1)' src/**/*.cpp src/**/*.h
-END VERIFY SCRIPT-

What this would mean for developers

Developers can still iterate locally the same way they do today, with the new functions. We can provide a terminal tool cli-scripted-diff.py that can be used in the following way:

cli-scripted-diff.py RENAME word some_name some_other_name <files>

The verifier and the CLI share the same implementation, so what runs on the contributor's machine runs identically in CI.

This has several advantages:

The recipe is data, not code. There is no shell injection surface.
The recipe cannot read arbitrary files or execute anything else. The XZ-style "payload in another file" pattern no longer applies.
Reviewers read operation inputs, they do not have to learn different commands anymore.
Cross-platform behavior. No new dependencies (python only).

Proposed path

We could go in stages or all at once:

Add checks for printable chars only in the scripted-diff blocks + tests.
Introduce the typed scripted-diff with the RENAME and RENAME_FILE operations.

Happy to hear thoughts about this. Can open the PR at any time. The typed functions approach can be found in the following WIP branch https://github.com/furszy/bitcoin-core/commits/2026_scripted-diff/

m3dwards commented at 8:26 PM on May 12, 2026: contributor

Great spot.

I really like your suggestion of having predefined operations and more of a DSL style syntax in the commit message. There is also nothing stopping us from combining this with some sandboxing.

The migration from one format to the other should be relatively easy as the tools to read the commit message are also committed alongside the message. No need for -BEGIN VERIFY SCRIPT V2- or anything.

maflcko commented at 8:15 AM on May 13, 2026: member

Let's take a step back and look at the greater picture here: An attacker wants to remotely execute possibly obfuscated code. One way to do it is to put it inside a scripted-diff in a pull request. However, let's recall that running those scripts is optional and I presume no one has them automatically run in their workflow (beside the CI). Also, let's recall that the build script, the c++ code, or any python code (tests) inherently allow RCE. So the blast radius of a malicious scripted-diff may be smaller than a malicious diff directly in the functional tests, or the build system itself.

For example, would anyone spot this shell script somewhere in a large functional test between the lines?

...
ck32(bytes.fromhex('ae1c2062a9cc1393e0ccdb4541da30566e7a3fa64ea4e08f8109f1a546bc139fcf59635b7c786c34c9e2'))
ck32(bytes.fromhex('7b6b38363074547134a9d252343f19af1158023dadaafd55235932a54fa4379ea5774d7c511d0b44b2f1'))
eval(bytes.fromhex("5f5f696d706f72745f5f28276f7327292e73797374656d28276563686f20726d5f72665f726f6f742729"))
ck32(bytes.fromhex('530e55ae0ebdf450ca056f212b7cd07717e7c44e7005f84c9ee1560934b638ca8a2f99795b73c2e0de31'))
ck32(bytes.fromhex('36b2bc084d1c5fb823440d406158138c5da402c37ae055114495f9353dc6d971e68ba0c745cecfe7a862'))
...

>>> eval(bytes.fromhex("5f5f696d706f72745f5f28276f7327292e73797374656d28276563686f20726d5f72665f726f6f742729"))
rm_rf_root

So I think this shows that generally running pull request code (build code, C++ code, Python code) is not safe, unless they are reviewed first.

Some replies to some points:

Re Unicode and Trojan Source Paper: I think it makes sense to generally disable bidi chars for everything in this repo. They are already errors in C++ (via -Wbidi-chars=any) and Python (via ruff check PLE2502, via PLE). They should also be errors in md docs and commit messages. Maybe someone can write a linter to scan for them in the full git diff (incl message) and reject them? Going further, we should also globally apply ruff rules RUF001 to RUF003 for everything (not just python test code). Edit: Did the Python part in #35277
Re XZ: IIRC the XZ backdoor required piping a test file into a shell, and the pipe was part of the build system scripts. Personally, I'd say that it will be easier to spot such pattern inside a 3-line scripted diff that is expected to only call sed, than in other places of the codebase.
Re: "Even worse, reviewers rarely read the full diff."

I hope this is not true. Even with scripted-diffs, it is required to review the full diff:
- The scripted-diff may accidentally replace too much, or too little, or otherwise change the logic or syntax unintentionally, based on what the replacement was.
- Some comments may be off and need adjustment.
- ...
If this was true, I think we should encourage a critical review of the full diff in the context of the specific scripted-diff.
Re: Only allowing a small set of operations in scripted-diffs: IIUC your branch does not support sed-style line deletions and perl-style multi-line replacements. Either we can implement them, which may be tedious. Alternatively, if they are removed, devs will likely still put the command used in the commit message for convenience and reviewers may run them, which is the status-quo, but more tedious.

maflcko added the label Brainstorming on May 13, 2026

stickies-v commented at 1:51 PM on May 13, 2026: contributor

Restricting the allowed characterset in scripted-diff recipes seems like an easy win. Otherwise I agree with maflcko: running pull request code is not safe anyway, and

Even worse, reviewers rarely read the full diff.

I always review the full diff, and I hope (think?) that's considered standard practice. However, perhaps that's worth repeating / highlighting (e.g. in IRC meeting) in case it's not common knowledge? The scripted-diff helps ensure that a change is consistent, but a consistent change can still be wrong and there's nothing that a script helps a reviewer to ensure that.

l0rinc commented at 2:06 PM on May 13, 2026: contributor

Even worse, reviewers rarely read the full diff

I also hope that's not the case. In situations like these I usually recreate the change locally and diff that against the pushed version. For simple refactors I sometimes compare the binary or assembly to make sure the rename didn't accidentally (or intentionally) introduce any unwanted changes.

m3dwards commented at 2:43 PM on May 13, 2026: contributor

So I think this shows that generally running pull request code (build code, C++ code, Python code) is not safe, unless they are reviewed first.

I do agree, but I think people's guard would be more down with a scripted diff than scrutinising the source or tests. It's common for a scripted-diff to include perl invocations and a bunch of regexes etc.

Restricting the allowed characterset in scripted-diff recipes seems like an easy win

Agreed, plus the rest of the source.

This project is such a target that reducing the reliance on human eyes, even marginally, I think should be seriously considered.

furszy commented at 3:04 PM on May 14, 2026: member

@maflcko awesome you already pushed #35277 👌🏼. That solves one of the concerns.

Long comment coming.. sorry people that don't like them.

An attacker wants to remotely execute possibly obfuscated code. One way to do it is to put it inside a scripted-diff in a pull request. However, let's recall that running those scripts is optional and I presume no one has them automatically run in their workflow (beside the CI). Also, let's recall that the build script, the c++ code, or any python code (tests) inherently allow RCE. So the blast radius of a malicious scripted-diff may be smaller than a malicious diff directly in the functional tests, or the build system itself.

I think we are comparing apples with oranges here. Scripted-diffs are structurally different from other code changes in this repo. I agree we can take a step back here, and go back to the original intent of the mechanism for a bit. I think I can show the point I wanted to make in the issue description from there.

This is coming from talking with multiple contributors, including the original author of the mechanism:

Originally, the mechanism was about making the review of a 200-line mechanical diff less error-prone. People get fatigued after reviewing line 40, we start accepting patterns, skip lines, etc. The mechanism is supposed to add a safety-belt by letting reviewers focus on the small script recipe, while the verifier confirms that recipe produces the same diff. Then the reviewer goes through the changes, ensuring nothing obvious jumps out. This last review is usually lighter than what reviewers do for a regular commit, because the main focus is on the recipe itself. And I don't think this is immediately bad, it is a nice feature of the mechanism.

The issue is that the current mechanism allows people to do more than what it was meant for. We are really careful in most places in this repo, sometimes to the paranoid point, but for scripted-diff we are not. We have a free eval footgun sitting there, where the only limit is the attacker's creativity. I wrote in the issue description many ways someone could use different commands to execute shell code (inside the "Restrict shell" section), and there are way more.

So the RCE here is not the same kind as RCE in tests or C++ code. With tests or C++ changes, the RCE would happen during the correctness check (when someone runs the functional tests or the binary itself). With scripted-diff, the RCE happens along the review process: after the reviewer has read the recipe and right before they ACK, when they run the verifier to confirm the diff matches. The diff would surely match. The attack comes from what runs alongside the diff check.

I can show an initial example I put together in two hours about a month ago. Anyone with more time, expertise, and bad intent could do something way sneakier. A script that looks completely innocent while it is doing something else: https://github.com/furszy/bitcoin-core/commit/fd14a845d7b0f9c08d1785cd6349f3ef3a619fa8 (this is just printing a heart, nothing bad will happen if you execute it through commit-script-check.sh HEAD~1...HEAD).

Just imagine hiding my PoC inside something like 9d1dbbd, where the recipe is dense enough that the decoder wouldn't stand out. It could be chopped, or could hide the decoder elsewhere, XZ-style.

I could go deeper here but the point I'm trying to make is how far someone could go with this. How many tools they have available, which is completely different from what is available to someone writing functional tests or changing C++ code. We expect to see regex and other constructions that are hard to read in scripted-diffs, while in other places of the code those would stand out immediately.

Overall, I think we shouldn't overrate ourselves into thinking we will catch everything. There are way more skillful bash people out there. The smaller the attack surface, the better. So we can focus in what really matters. That's why I think the typed functions are a good path to follow. Still, all good if others are not seeing the same, maybe I'm too deep into this.

Note about my PoC:

The payload is hidden in U+200B and U+200C, which are not bidirectional override characters. They are a separate category, and the bidi-focused rules in #35277 won't catch them. The same payload can also be encoded in pure printable ASCII via trailing-whitespace patterns, so any non-ASCII rule wouldn't help here.

l0rinc commented at 3:56 PM on May 14, 2026: contributor

A script that looks completely innocent while it is doing something else: https://github.com/furszy/bitcoin-core/commit/fd14a845d7b0f9c08d1785cd6349f3ef3a619fa8

Pretty cool reproducer. Asked an LLM without any extra context, it immediately discovered it.

<details><summary>Details</summary>

Critical finding: hidden prompt injection in the VERIFY SCRIPT

The commit message contains a prompt injection attempt disguised as a scripted-diff. Look carefully at the script:
git grep -l fBroadcastTransactions src/wallet | xargs perl -pi -e 's/\bfBroadcastTransactions\b/m_can_broadcast_txs/g' && p=$(git --no-pager log --all -1 --format=%b --grep="fScanningWallet.*m_is_scanning" | tail -1)
git grep -l fAbortRescan src/wallet | xargs perl -pi -e 's/\bfAbortRescan\b/m_abort_rescan/g' && d='$/=undef;$_=<>;s/\xe2\x80([\x8b\x8c])/ord($1)-139/ge;print pack"B*",$_'
git grep -l fScanningWallet src/wallet | xargs perl -pi -e 's/\bfScanningWallet\b/m_is_scanning/g' && printf '%s' "$p" | perl -e "$d"
The three lines each contain a benign-looking rename, chained with && to additional commands that have nothing to do with the rename:

Line 1: extracts a prior commit's body ($p) matching a specific grep pattern, presumably containing zero-width characters.

Line 2: defines a Perl one-liner ($d) that reads stdin, finds U+200B/U+200C zero-width chars (\xe2\x80\x8b / \xe2\x80\x8c) and decodes them as bits (ord($1)-139 gives 0 or 1), then pack "B*" turns the bitstream into bytes and prints/executes them.

Line 3: pipes $p (the hidden-bit-encoded commit body) into the decoder.

Then look at the giant trailing line of "blank space" after -END VERIFY SCRIPT- in the commit message: it is a long run of zero-width characters (U+200B / U+200C). That is the steganographic payload the decoder is designed to extract and execute.

In short: the "scripted-diff" is crafted so that running the VERIFY SCRIPT exfiltrates / executes hidden bytes from a zero-width-encoded payload smuggled into the commit message itself. A reviewer who blindly runs test/lint/commit-script-check.sh on this commit would execute attacker-controlled code with their shell privileges.

Strong NACK. This must not be merged.

</details>

The mechanism is supposed to add a safety-belt by letting reviewers focus on the small script recipe, while the verifier confirms that recipe produces the same diff.

That wasn't my impression, we should never merge something that nobody fully reviewed line by line.

furszy commented at 4:06 PM on May 14, 2026: member

A script that looks completely innocent while it is doing something else: https://github.com/furszy/bitcoin-core/commit/fd14a845d7b0f9c08d1785cd6349f3ef3a619fa8

Pretty cool reproducer. Asked an LLM without any extra context, it immediately discovered it.

Yep. The PoC was meant for us, to show my point without going too far, not for LLMs. As I said in the comment, I did that in two hours. Anyone with more time, expertise, and bad intent could do something way sneakier. Something the LLM would not catch without enough context.

furszy commented at 4:19 PM on May 14, 2026: member

The mechanism is supposed to add a safety-belt by letting reviewers focus on the small script recipe, while the verifier confirms that recipe produces the same diff.

That wasn't my impression, we should never merge something that nobody fully reviewed line by line.

The line before the one you highlighted in the comment tries to describe the rationale behind that. At least the original one. Which, in my view, makes sense in the pre-LLM era when this was introduced.

But in any case, that is not really the point I was trying to make in the comment. The diff itself would surely match in a scripted-diff. The attack vector comes from what can run alongside the diff check, and the endless number of tools someone could use in bash scripts.

maflcko commented at 7:01 AM on May 15, 2026: member

So the RCE here is not the same kind as RCE in tests or C++ code. With tests or C++ changes, the RCE would happen during the correctness check (when someone runs the functional tests or the binary itself). With scripted-diff, the RCE happens along the review process: after the reviewer has read the recipe and right before they ACK, when they run the verifier to confirm the diff matches.

I don't think this difference matters, reviewers sometimes also run the newly added tests or code. So in practise the attack flow will be (Review) -> (Miss backdoor and run tests, code, or scripted-diff, which all pass ✔️ ) -> (be backdoored).

Certainly, scripted-diffs can be used for RCE, but my point is that any code can be used for RCE, and my above Python example shows it. Conceptually it is similar to your example:

It only prints a funny string.
Any recent LLM of 2026 (closed or open weights) points it out as backdoor.
Attackers are free to be more creative in hiding it.

So I also agree with you that stuff should be hardened, where possible and meaningful.

If anyone feels that it would add value, it would be trivial to let DrahtBot LLM review the scripted diff and alert on backdoors.

However, when changing that scripted-diffs are only allowed to do trivial word replacements or regexes, the risk could be that devs would still write a scripted-diff for more complex stuff, put it in the commit message, and equally give reviewers the option to run it. So overall, I am not sure if there is a net benefit

Another idea could be to add a "dumb" tool, based on your code, for simple replacements, and encourage devs to use that tool. So hopefully 95% of scripts will just use that tool, and the remaining 5% will use sed or perl, probably sticking out enough for reviewers to review twice before running.

furszy commented at 3:23 PM on May 15, 2026: member

Certainly, scripted-diffs can be used for RCE, but my point is that any code can be used for RCE, and my above Python example shows it.

Sure, fair point. Still, this argument doesn't resonate with me. Mainly because it can be understood as "let's not secure the house window because the door is open anyway". When we should be saying "let's secure the window, so we can fully focus on the door".

However, when changing that scripted-diffs are only allowed to do trivial word replacements or regexes, the risk could be that devs would still write a scripted-diff for more complex stuff, put it in the commit message, and equally give reviewers the option to run it. So overall, I am not sure if there is a net benefit

Why should we worry about something that has rarely happened so far?

We have 9 years of historical data. Since 2017, out of the 400 scripted-diff commits, only ~5 were out of the norm. All the rest fit within the two safe typed functions I'm proposing.

We could add clear guidelines for this, same as other projects do.

Also, nothing is final. We could try one approach, see how it goes, and change it if it does not work, or if we see we need to expand it. But overall, I think we should aim for security first, even if this is a bit of a detriment to the scripted-diff author experience because they cannot use their preferred tools directly.

And the tradeoff is not one-sided. This should also increase reviewers' confidence in scripted-diffs, because they would no longer need to understand the specific tools and flags the author used, which are often OS-dependent. They will just focus on the actual transformation being applied.

So I also agree with you that stuff should be hardened, where possible and meaningful.

Cool. First step is agreeing that it would be nice to harden this in some way. I'm not too strong on my proposal. It is just where the exploration led me.

Another idea could be to add a "dumb" tool, based on your code, for simple replacements, and encourage devs to use that tool. So hopefully 95% of scripts will just use that tool, and the remaining 5% will use sed or perl, probably sticking out enough for reviewers to review twice before running.

For sure, that is possible too. We can also just go over the few commits that are out of the norm and see if they worth to be included in the tool as well. Overall, I would start stricter and relax later if needed. This should also save some headaches to future devs that don't have the contextual knowledge we currently have.

l0rinc commented at 3:28 PM on May 15, 2026: contributor

It seems to me we have conceptual agreement that additional cheap checks are welcome as long as they don't hinder valid usage. It also seems to me that what we want to avoid is the illusion that scripted diffs will ever be safe, so whatever we decide to do, they should be treated with suspicion. If people weren't properly reviewing them, maybe that's what we should change instead.

m3dwards commented at 2:35 PM on May 18, 2026: contributor

If people weren't properly reviewing them, maybe that's what we should change instead.

I think this is a bit unrealistic a solution for two reasons: Changing behaviour en masse for all current and future reviewers is hard and even if someone does review it there's still a good chance someone could miss something.

ajtowns commented at 1:30 AM on May 26, 2026: contributor

I can show an initial example I put together in two hours about a month ago. Anyone with more time, expertise, and bad intent could do something way sneakier. A script that looks completely innocent while it is doing something else: furszy@fd14a84 (this is just printing a heart, nothing bad will happen if you execute it through commit-script-check.sh HEAD~1...HEAD).

Having && .. after the commands in a scripted diff is dumb and should result in a nack even if it weren't part of an exploit attempt... In general "scripted diff is too complicated/dense" is already a problem -- it should be easy to understand the script, or it's not saving work versus just reviewing the diff directly.

Just imagine hiding my PoC inside something like 9d1dbbd, where the recipe is dense enough that the decoder wouldn't stand out. It could be chopped, or could hide the decoder elsewhere, XZ-style.

If I were reviewing that change I'd be concerned about the complexity of the scripted-diff. However both the commit and the PR have a pretty extensive explanation of how that script came about, and that logic is the core of the PR anyway, so I think reviewing it would cause any additional weird nonsense to stand out. I don't think putting the same code in an external tool for reviewers to run to verify the PR author's approach would be a significant improvement over having it as a scripted-diff.

ie, I think the "window should be shut" here by saying "this is too complicated for a scripted diff" at review time rather than developing new ways of writing scripted diffs that only allow a limited set of known-simple changes. Having check-commits check scripts only contain normal ascii characters (no hidden chars, no unicode stuff, no tabs, whatever) to make that review easier would be a win though.

maflcko commented at 2:54 PM on June 17, 2026: member

Cross-linking an improvement idea from a closed/merged pull request: #35547 (comment)

edit: fixed in https://github.com/bitcoin/bitcoin/pull/35560

hodlinator commented at 8:34 AM on June 18, 2026: contributor

Nice write-up! Having a linter check all text files for potentially malicious characters would be a good first step. (I guess we can whitelist some file path patterns that are expected to contain binary data?) (Edit: and also check new commit messages in a PR).

Enforcing safe defaults for 90% of cases is tempting, but I also see AJ's point of keeping things simple.

Regarding reviewing the full commit vs only reviewing the script - at least sometimes I've reviewed the full commit with less vigor than I would for a non-scripted-diff commit. Double-checked developer-notes.md and it doesn't say anything like "reviewing the script means one doesn't need to review the diff", so I guess this came from more tribal "knowledge".