When an AI Writes the Math Paper

The FrontierMath: Open Problems benchmark has a strict criterion for inclusion: only problems with no known solution, where a correct answer would constitute a publishable result in its own right. This week, Epoch confirmed the first solve.

The problem — a Ramsey-style challenge on hypergraphs — asked for large hypergraphs avoiding a certain partition property, specifically for improving lower bounds on the sequence H(n) studied by Will Brian and Paul Larson. It’s a combinatorics problem rooted in Ramsey theory and infinite series convergence, and while the setup is describable in a paragraph, finding better constructions had resisted effort for years. GPT-5.4 Pro, guided by Kevin Barreto and Liam Price using a prompting workflow they’d built up through work on Erdős-type problems, produced a construction establishing H(n) ≥ (26/25)·k_n for n ≥ 15.

Brian’s response is worth quoting directly: “This is an exciting solution to a problem I find very interesting. I had previously wondered if the AI’s approach might be possible, but it seemed hard to work out. Now I see that it works out perfectly. It eliminates an inefficiency in our lower-bound construction and in some sense mirrors the intricacy of our upper-bound construction.” He plans to write it up for publication, potentially listing the prompting collaborators as coauthors.

A few things are worth being precise about here, because the framing matters. The solution is a Python program that constructs the hypergraphs — closer to computational combinatorics than a traditional pen-and-paper proof. Commenters on HN noted this and flagged the problem as possibly “low-hanging fruit” in the open problems set. It consumed around 250k tokens to reach the answer, which is a lot of compute for one construction. And when Epoch subsequently ran their own evaluation scaffold, they found that Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh) could also solve it with appropriate prompting.

That last point is actually more significant than it sounds. A problem being solvable by four current frontier models when properly prompted means it was at the edge of current-generation capability broadly, not just a one-off fluke. It’s evidence of a capability regime, not a single outlier. Whether that regime extends to harder problems in the open set is the more interesting question, and one that’s being actively tested.

The broader FrontierMath context is that the benchmark went from near-zero accuracy in late 2024 to 50% on Tiers 1–3 for GPT-5.4 Pro in about 16 months. That trajectory makes it genuinely hard to reason from current numbers about what problems will look like hard in a year.


On the infrastructure side, Cloudflare published a post on March 19 that deserves attention if you’re thinking about the economics of high-volume agentic workloads. They added Kimi K2.5 to Workers AI — Moonshot AI’s 256k-context open-weight model — and reported a 77% cost reduction for an internal security review agent that processes 7 billion tokens daily. The projected savings: $2.4M/year compared to a comparable proprietary model. To get there they built custom inference kernels using disaggregated prefill (separating the prefill and generation stages across different machines) on their Infire engine, plus prefix caching with session affinity headers to improve cache hit rates in multi-turn conversations. It’s a real production workload, not a benchmark-optimized demonstration.

The Bonk code review agent — open-source on GitHub, built on OpenCode, responding to /bonk mentions in GitHub PRs — is the user-facing artifact from that work. But the more interesting story is what Cloudflare’s numbers reveal about where open-weight models on managed inference now sit competitively against proprietary APIs for sustained, token-heavy workloads.

Mozilla’s ai team also shipped cq on March 23, an MCP server that functions as a shared knowledge layer for agents. The premise: agents running against the same codebase repeatedly waste tokens rediscovering the same things. cq lets agents store what they learn and future agents query it, building confidence in solutions through repeated validation. It’s a proof-of-concept for now, but the problem is real — agents are currently stateless across runs in ways that humans working in a codebase are not.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.