A new framework reveals that some leading AI models may be getting significant artificial score boosts from accidentally studying the answers beforehand
Picture this: You’re comparing two students’ test scores to see who’s smarter. Student A gets 95%, Student B gets 78%. Case closed, right? But what if you later discovered that Student A had somehow seen most of the test questions beforehand, while Student B went in completely blind? Suddenly, that comparison doesn’t look so fair.
This exact scenario is playing out in the AI world, and it’s a bigger problem than most people realize. A new research paper submitted to arXiv in July 2025 introduces ArxivRoll, a system that detects when AI models are getting inflated scores from accidentally training on test data. The results show varying degrees of contamination across model families, with some showing significant overestimation while others maintain more honest performance profiles.
The Homework Problem in AI Evaluation
Here’s the thing about evaluating AI models: we need standardized tests to compare them, just like SATs for college admissions. But there’s a catch. These benchmark tests get published publicly, and then something predictable happens – they eventually leak into the massive datasets used to train new AI models. It’s like having tomorrow’s exam questions accidentally mixed into today’s study materials.
This isn’t necessarily intentional gaming by AI companies. Training datasets are so enormous that it’s nearly impossible to manually scrub every benchmark question. But the effect is the same: inflated scores that make models look smarter than they actually are.
Enter ArxivRoll: The Anti-Cheating Detective
The ArxivRoll team came up with an elegant solution inspired by one-time pad encryption in cryptography. One-time pads are the gold standard of secret codes: imagine you and a friend each have identical notebooks filled with random numbers, and you use each page exactly once to encrypt a message, then tear it out and burn it. Even if someone intercepts your message, they can’t crack it because the “key” no longer exists. ArxivRoll applies this concept to AI testing by creating brand new benchmark tests every six months using fresh research papers from arXiv that no model could have seen during training.
The system has two main components that work like a sophisticated cheating detection system:
SCP (Sequencing, Cloze, and Prediction) automatically generates test questions from recent papers. Think of it as a test-making robot that can read any research paper and instantly create three types of questions:
– Sequencing: Takes sentences from a paper and jumbles them up – can the model put them back in the right order?
– Cloze: Hides sentences from a paper – can the model fill in the blanks?
– Prediction: Shows part of a passage – can the model predict what comes next?
- Rugged Score (RS): Ratio comparing public vs private benchmark performance
- RS > 1: Model performs better on public tests (suggests contamination)
- RS ≈ 1: Balanced performance (minimal contamination)
- RS < 1: Model performs better on fresh private tests (clean training)
Rugged Scores (RS) act like forensic investigators, comparing how models perform on public benchmarks versus these fresh private tests. The researchers developed two main types: Absolute RS measures the raw performance difference between public and private benchmarks (where 1.41 means the model scores 41% higher on public tests), while Relative RS tracks how much a model’s ranking changes when contamination is removed. A third metric, RS_II, detects “biased overtraining” – when models excel in some domains but underperform in others, suggesting focused training on specific benchmark types.

The Results: Contamination Varies Across Model Families
When the team tested more than 50 different AI models, the results revealed a complex picture across the AI landscape. Some models showed clear signs of contamination, while others actually performed better on fresh, unseen tests.
Specifically, from the paper’s findings:
– Phi family mixed results: Phi-1 showed contamination (Absolute RS of 1.21), Phi-2 actually performed better on private tests (0.62), while Phi-3-mini returned to problematic levels (1.27)
– Qwen family surprises: Most smaller Qwen models performed better on fresh tests (Qwen2-7B at 0.69, Qwen2.5-7B at 0.70), but Qwen2.5-72B showed the highest contamination score (1.41)
– Llama models generally clean: Most Llama variants showed balanced performance, with many scoring close to 1.0
This doesn’t necessarily mean the companies behind these models intentionally gamed the system. More likely, their massive training datasets inadvertently included benchmark questions that had circulated online, giving these models an unfair advantage they didn’t even know they had.
But here’s where it gets interesting: open-source models like Kimi-K2 performed as “the best-performing open-source model, consistently achieves accuracy rates exceeding 40%, closely matching Gemini and Claude and even surpassing it in some tasks” on the fresh private benchmarks. This suggests that some open-source systems can achieve competitive performance without relying on contamination advantages.

Why This Matters More Than You Think
This research reveals a concerning issue in AI evaluation that deserves attention. When model companies and researchers tout their latest achievements on public benchmarks, this work suggests we should examine those claims more carefully. Some apparent “improvements” in AI capability might reflect training data overlap rather than genuine advances.
The implications go beyond academic bragging rights. Companies make business decisions about which AI models to adopt based on benchmark scores. Investors fund startups based on claimed performance improvements. Researchers build their work on top of models they believe are more capable. If these scores are artificially inflated, we’re building on shaky foundations.
The Bigger Picture: An Arms Race of Honesty
ArxivRoll’s approach is clever, but it’s not a silver bullet. The six-month refresh cycle is both its strength and weakness. It provides enough time for proper evaluation while ensuring freshness, but determined bad actors could potentially game even this system.
What’s more encouraging is that the automatically generated SCP questions correlate strongly with human-voted leaderboards like ChatbotArena, with correlations reaching up to 0.76. This suggests that when contamination is removed, the automated tests still capture genuine model capabilities that humans recognize as valuable.
The research also highlights an important measurement challenge: the AI field has been working with potentially skewed model comparisons. We’ve been ranking runners in a race where some participants may have had unintentional advantages.
Looking Forward: The Need for Benchmark Hygiene
ArxivRoll represents more than just a new evaluation method – it’s a call for better “benchmark hygiene” in the AI community. The framework could become a standard part of model evaluation, running alongside traditional benchmarks to provide contamination-free scores.
For AI practitioners, this research suggests we should be more skeptical of dramatic improvements on established benchmarks, especially when they come from models trained on increasingly large web-scale datasets. The most impressive gains might be happening in contamination detection and training data curation rather than actual intelligence.
The democratizing effect is perhaps most interesting. By leveling the playing field and removing accidental advantages, systems like ArxivRoll might reveal that the gap between open and closed-source models is smaller than we thought. The strong performance of models like Kimi-K2 on fresh benchmarks suggests that open-source development can compete effectively when everyone plays by the same rules.
The Transparency Test
What makes ArxivRoll particularly compelling is its commitment to transparency. Each test is published after use, maintaining reproducibility while preventing future contamination. It’s like publishing exam questions after everyone has taken the test – maintaining accountability without compromising security.
This transparency will likely pressure model creators to clean up their training processes. No one wants to be the next model family identified with contamination scores exceeding healthy baselines.
The AI field is still young, and we’re learning how to evaluate these systems properly. ArxivRoll represents a maturation of our evaluation methods – a move from naive trust to sophisticated verification. As AI models become more powerful and consequential, having honest measures of their capabilities isn’t just nice to have; it’s essential.
The homework problem is real, but now we have tools to detect it. The question is: will the AI community embrace this kind of rigorous self-policing, or will it take regulatory pressure to ensure honest evaluation becomes the norm?
ArxivRoll source code and benchmarks are publicly available, continuing the researchers’ commitment to transparency in AI evaluation.