LLMs Spot Subtle Linux Kernel Bugs Through Code Alone


Large language models are beginning to demonstrate tangible utility in complex vulnerability research workflows. A recent experiment showed that OpenAI’s o3 model was able to identify CVE-2025-37899—a previously unknown, remotely exploitable use-after-free vulnerability in the Linux kernel’s ksmbd subsystem—purely through static analysis of source code. The discovery was made by feeding o3 the raw C code for the SMB2 logoff handler, along with supporting context, and asking it to identify use-after-free bugs. There were no plug-ins, frameworks, or external tools involved—just o3 and a well-constructed prompt.

The vulnerability itself arises in a scenario where multiple worker threads serve concurrent SMB requests over different connections bound to the same session. The SMB3 protocol allows this behavior, which means different threads may share access to the same ksmbd_session and, crucially, to its user field. The problem occurs when one thread processes a LOGOFF request and frees sess->user, while another thread is still using it. Because no reference counting or synchronization protects the shared user object, this leads to a classic use-after-free condition.

The sequence of events looks like this:

  1. A second connection is bound to an already-authenticated session (conn->binding == true).
  2. Worker-A handles a request (e.g. a WRITE) and retains a pointer to sess->user, without incrementing any refcount.
  3. Concurrently, Worker-B receives a LOGOFF command and frees sess->user, followed by setting it to NULL.
  4. Worker-A proceeds to dereference the now-freed pointer, leading to either memory corruption or a kernel NULL dereference, depending on timing.

The discovery was particularly interesting not only because of the technical subtlety—inter-thread race conditions and weak synchronization—but because o3 surfaced this without human guidance. It understood the interplay between session binding, asynchronous request handling, and lifetime of shared data structures.

In a related benchmark, the author also evaluated o3 against CVE-2025-37778, another use-after-free vulnerability in the Kerberos authentication path of ksmbd. In this case, freeing of sess->user was mistakenly assumed to be safe, even though there are code paths where it remains accessible post-free. o3 was able to identify this vulnerability in 8 out of 100 experimental runs using a 3.3k line-of-code input window. Although the signal-to-noise ratio remains a challenge—o3 produced 28 false positives and 66 false negatives in the same experiment—the correct outputs were accurate and well-articulated.

Interestingly, the model demonstrated an understanding of the mitigation implications too. In some outputs, it correctly identified that setting sess->user = NULL is not a complete fix, due to session binding allowing access across connections after free. This kind of nuanced reasoning is rare among automated analysis tools.

Scaling the test to a broader code context—feeding o3 the entire smb2pdu.c file (12k LoC, ~100k tokens)—did degrade performance: the Kerberos vulnerability was only rediscovered once. But in that broader context, o3 found CVE-2025-37899 instead, which hadn’t previously been reported.

As the author notes in his write-up, LLMs like o3 are now occupying a middle ground in program analysis tools—less precise than symbolic execution, less brute-force than fuzzing, but more creative and flexible than either. While they aren’t perfect, their signal is now strong enough to warrant serious integration into vulnerability research pipelines.