Cheddar Bench: Coding Agents Playing Bug Treasure Hunt

Let’s talk about Cheddar Bench—a brilliant unsupervised benchmark that’s turning bug detection into an exciting treasure hunt for CLI coding agents. Imagine agents taking on dual roles: challengers sneakily hide bugs in code repositories (logging these little gifts in a bugs.json manifest), while reviewer agents work to uncover them. The magic happens without human intervention, thanks to a nifty LLM matcher that scores the reviewer findings against the planted bugs.

This quirky setup isn’t just a test of skill; it’s an innovation playground. Each tool (like Claude Code, Codex CLI, Gemini CLI) uses its own strategies to tackle the challenge, showcasing its unique capabilities. The results? Out of 2,603 injected bugs, Claude leads the pack, catching 58.05% of those sneaky errors, with Codex and Gemini following suit.

Why should we care? This benchmark not only tests these CLI agents in a dynamic, unsupervised environment but also pushes the boundaries of AI-driven bug detection. Imagine applying such efficient automation to real-world programming, accelerating debugging and improving software quality far beyond current capabilities. Cheddar Bench is truly a clever twist in the world of coding AI tools, opening the door to fascinating future applications.
Read more at GitHub…

Cheddar Bench: Coding Agents Playing Bug Treasure Hunt

Related

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS

Shai-Hulud and the Danger of Trusted Packages

When the Future Remembers First

YellowKey Turns BitLocker Into an Open Door

When Representation Beats Infrastructure