Using LLMs to find Python C-extension bugs — Ankur Sethi's Internet Website

Jake Edge, LWN.net:

[…] Hobbyist Daniel Diniz used Claude Code to find more than 500 bugs of various sorts across nearly a million lines of code in 44 extensions; he has been working with maintainers to get fixes upstream and his methodology serves as a great example of how to keep the human in the loop—and the maintainers out of burnout—when employing LLMs.

It's worth reading Daniel Diniz's post on the Python forums in full. This is a great example of an engineer with specific domain expertise using LLMs to augment and amplify his abilities. Not just that, he's working closely with maintainers to ensure he's not inundating them with slop PRs or unreproducible bug reports.

The part I find most interesting is how Daniel's Claude Code plugin works. He writes in his forum post:

I built a Claude Code plugin called cext-review-toolkit. The key difference from traditional static analysis is that this system tracks Python-specific invariants (refcounts, GIL discipline, exception state) across control flow, and validates findings with targeted reproducers. That is done by 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class.
The agents use Tree-sitter for C/C++ parsing, which enables analysis that pattern matching can’t do, like tracking borrowed reference lifetimes across function calls, or cross-referencing type slot definitions with struct members.
Each agent can run a scanner script to find candidates, then performs qualitative review of each candidate to confirm or dismiss it. The scripts have a ~20-40% false positive rate and the agents are there to bring that down. After the agents finish, I try to reproduce every finding from pure Python and write a reproducer appendix.

Later from the same post:

Traditional tools like clang-tidy, Coverity, and sanitizers struggle with Python C API semantics (reference ownership, exception state, GIL constraints). The analyses cext-review-toolkit performs target those invariants specifically. Besides that, the tool uses guided semantic analysis (LLM-assisted) to analyze aspects like “was that bugfix complete, and do similar bugs still lurk in the codebase?” that other tools cannot cover.
The rich set of agents cover:
Reference counting: leaked refs, borrowed-ref-across-callback, stolen-ref misuse.
Error handling: missing NULL checks, return without exception, exception clobbering.
NULL safety: unchecked allocations, dereference-before-check.
GIL discipline: API calls without GIL, blocking with GIL held.
Type slots: dealloc bugs, missing traverse/clear, __new__-without-__init__ safety.
PyErr_Clear: unguarded exception swallowing (MemoryError, KeyboardInterrupt).
Module state: single-phase init, global PyObject* state.
Version compatibility: deprecated APIs, dead version guards.
Git history: fix completeness (same bug fixed in one place but not another).
Plus: stable ABI compliance, resource lifecycle, complexity analysis.

So cext-review-toolkit is not just a set of prompts that tell Claude to go find bugs. It combines detailed descriptions of specific classes of bugs with scripts powered by Tree-sitter that allow Claude to extract rich semantic data from the codebase it's analyzing. The LLM is not doing all of the heavy lifting here. It works in tandem with human expertise encoded in prompts and deterministic scripts custom built for acting on those prompts.

To me, this feels like the most effective use of LLMs for domain-specific tasks that don't exist in training data: encode as much of your logic into deterministic tools as you can, encode the more squishy parts of your domain into prompts, and let an agent drive those tools.

I can see a possible future where every project has its own version of cext-review-toolkit that encodes common classes of bugs the project deals with repeatedly. How much would something like this improve code quality? How much better would it be versus the generic PR review agents we use today?