Cheddar Bench: Coding Agents Playing Bug Treasure Hunt

Cheddar Bench: Coding Agents Playing Bug Treasure Hunt
Let’s talk about Cheddar Bench—a brilliant unsupervised benchmark that’s turning bug detection into an exciting treasure hunt for CLI coding agents. Imagine agents taking on dual roles: challengers sneakily hide bugs in code repositories (logging these little gifts in a bugs.json manifest), while reviewer agents work to uncover them. The magic happens without human intervention, thanks to a nifty LLM matcher that scores the reviewer findings against the planted bugs.

This quirky setup isn’t just a test of skill; it’s an innovation playground. Each tool (like Claude Code, Codex CLI, Gemini CLI) uses its own strategies to tackle the challenge, showcasing its unique capabilities. The results? Out of 2,603 injected bugs, Claude leads the pack, catching 58.05% of those sneaky errors, with Codex and Gemini following suit.

Why should we care? This benchmark not only tests these CLI agents in a dynamic, unsupervised environment but also pushes the boundaries of AI-driven bug detection. Imagine applying such efficient automation to real-world programming, accelerating debugging and improving software quality far beyond current capabilities. Cheddar Bench is truly a clever twist in the world of coding AI tools, opening the door to fascinating future applications.
Read more at GitHub…