What Is Benchmark Gaming in AI? Why Self-Reported Scores Are Often Inflated
Kimi K2 reported 50% on HLE but independent testing found 29.4%. Learn how benchmark gaming works and how to evaluate AI models honestly.
The Gap Between What AI Companies Claim and What Independent Tests Find
When Moonshot AI released Kimi K2, they reported a 50% score on Humanity’s Last Exam — an impressive number for one of the hardest AI benchmarks in existence. Independent researchers ran their own tests shortly after. Their result: 29.4%. That’s a 20-point gap on the same benchmark, supposedly measuring the same model.
This kind of discrepancy isn’t a fluke. It’s a symptom of a much wider problem in AI evaluation called benchmark gaming — and if you’re making decisions about which AI models to use, you need to understand how it works.
What Benchmark Gaming Actually Means
Benchmark gaming refers to any practice that inflates an AI model’s reported performance on standard tests without reflecting genuine capability improvements.
It’s the AI equivalent of teaching to the test. A student who memorizes past exam questions might score well without actually understanding the subject. AI models can do something similar — and companies have strong incentives to make their scores look as good as possible.
The term covers a range of behaviors, from outright data contamination (where a model is trained on benchmark questions) to more subtle practices like selective reporting, special prompting techniques, or cherry-picking evaluation conditions that favor the model.
Why It Matters
Benchmarks exist to help researchers and practitioners compare models objectively. They’re supposed to answer: “Which model is actually better at reasoning, coding, math, or following instructions?”
When scores are gamed, that signal breaks down. Developers building products, researchers deciding what to study, and businesses choosing AI vendors all rely on benchmark scores to some degree. Inflated numbers lead to bad decisions.
How Benchmark Gaming Works: The Main Techniques
There are several distinct ways a model’s benchmark score can end up higher than its real-world capability warrants.
Data Contamination
This is the most direct form of gaming. If a model’s training data contains the questions (or answers) from a benchmark, it can effectively “memorize” them. When evaluated, it performs well not because it can reason through problems, but because it’s seen the answers before.
This is surprisingly easy to do accidentally — the internet contains a lot of benchmark data — but it can also be done deliberately. Detecting contamination is difficult, since companies don’t always publish their full training datasets.
Selective Reporting
A company runs a model against 20 benchmarks. On 14 of them, performance is average. On 6, it looks exceptional. The press release highlights those 6.
This isn’t technically dishonest, but it creates a misleading picture. Selective reporting is one of the most common forms of benchmark gaming, and there’s rarely any obligation to publish all results.
Prompt Tuning and Special Conditions
Some benchmarks have standard evaluation protocols. Others don’t. A model developer might use a specific system prompt, chain-of-thought formatting, or other prompting strategy that significantly boosts scores — but that isn’t available to end users in normal deployment.
When an independent researcher replicates the test using a different (more realistic) setup, scores drop. This is likely part of what happened with Kimi K2’s HLE discrepancy.
Overfitting to Benchmark Style
Even without direct contamination, a model can be fine-tuned to perform well on the format and style of specific benchmarks. If you train on enough similar problems, you can score well without developing the underlying reasoning ability the benchmark was designed to measure.
This is particularly problematic with math benchmarks like MATH or GSM8K, where models have gotten suspiciously good in ways that don’t always transfer to novel problems.
Metric Cherry-Picking
Different evaluations use different metrics. Pass@1, pass@10, majority vote, and other aggregation methods can produce very different numbers from the same underlying model outputs. Choosing the most favorable metric — without clearly disclosing the choice — inflates apparent performance.
Real Examples of the Gap Between Claimed and Independent Results
The Kimi K2 case is recent and stark, but it’s not isolated.
Humanity’s Last Exam (HLE) has become a useful stress test precisely because it’s hard enough that scores can’t be easily faked through memorization. Even so, reported scores from model developers and independently measured scores have diverged in several cases.
Coding benchmarks like HumanEval and MBPP have seen similar issues. Models that score in the 80s or 90s on self-reported HumanEval often perform noticeably worse when researchers use modified problem sets that test the same skills with different syntax or structure.
MMLU contamination has been documented in multiple studies. Because MMLU questions were widely circulated on the internet, many models have almost certainly been trained on some portion of the test set — making scores on this benchmark particularly unreliable as a measure of genuine reasoning.
The pattern is consistent: self-reported numbers tend to be higher, and the gap tends to widen for models from companies with less rigorous evaluation practices or less external accountability.
Why the Leaderboard Culture Makes This Worse
There’s an arms race dynamic at work here. When benchmark scores become the primary way models are marketed and compared, every lab faces pressure to publish the highest possible numbers.
This is a direct instance of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. The better benchmarks get at distinguishing capable models, the more incentive there is to game them — and the less useful they become.
The AI community has tried to respond to this in a few ways:
- Regular benchmark refresh cycles — introducing new, harder, less-contaminated benchmarks (HLE is an example of this)
- Third-party evaluation organizations — groups like HELM at Stanford, or platforms like Chatbot Arena that use blind human preference ratings
- Reproducibility requirements — some conferences and publications now require model cards and detailed evaluation methodology
But these efforts are downstream of the incentive problem. As long as leaderboard position drives funding, partnerships, and press coverage, gaming will continue.
How to Evaluate AI Models More Honestly
If you can’t fully trust self-reported benchmarks, how should you actually compare AI models?
Use Blind, Human-Preference Evaluations
Chatbot Arena (formerly LMSYS Arena) is one of the better tools available. It presents users with two anonymous model outputs and asks which is better. Because users don’t know which model they’re rating, it’s harder to game. The Elo-style rankings produced tend to correlate better with real-world performance than many standard benchmarks.
Test on Your Own Use Case
This is the most practical advice: build a small evaluation set based on what you actually need the model to do. If you’re using AI for customer support summarization, test 50–100 real examples. If you need code generation, test on your actual codebase structure.
Benchmarks measure general capability under controlled conditions. Your use case is specific.
Look for Independent Replication
When a company publishes impressive benchmark numbers, wait to see if independent researchers replicate them. Groups like Epoch AI, EleutherAI, and various academic labs routinely reproduce evaluations. If nobody else can reproduce the score, that’s meaningful.
Check the Evaluation Methodology
Before trusting a number, ask: who ran the evaluation? What prompt format was used? Was it the model’s standard API, or a custom configuration? Were all results disclosed, or just selected ones? Companies with nothing to hide generally answer these questions clearly.
Be Skeptical of Suspiciously Clean Numbers
If a model claims 90%+ on a benchmark where state-of-the-art was 60% six months ago, and there’s no obvious explanation (new architecture, much more training data, etc.), that’s worth scrutinizing.
Where MindStudio Fits Into Model Selection
One practical challenge for teams building AI products is figuring out which model to actually deploy — not based on benchmark marketing, but based on real task performance.
MindStudio gives you access to 200+ AI models — Claude, GPT-4o, Gemini, Mistral, and many others — in a single no-code builder. The useful thing here isn’t just the breadth; it’s that you can run the same workflow against multiple models and directly compare outputs on your actual use case.
Instead of trusting a leaderboard, you can build a test workflow in MindStudio, feed it your real inputs, and see which model performs best on the specific task you’re trying to automate. That’s closer to a legitimate evaluation than any self-reported benchmark score.
This matters especially for specialized tasks — document extraction, tone-matched writing, structured data parsing — where general benchmarks are a poor proxy. You can try MindStudio free at mindstudio.ai and run your own comparisons without needing separate API keys or accounts for each model.
FAQ: Benchmark Gaming in AI
What is benchmark gaming in AI?
Benchmark gaming refers to practices that inflate an AI model’s performance on standard evaluation tests without reflecting genuine improvements in capability. This includes training on benchmark data (data contamination), selective reporting of favorable results, using special prompting conditions not available to normal users, and overfitting models to the style of specific tests.
Why did Kimi K2 show such a large gap between claimed and independent scores?
Moonshot AI reported Kimi K2 scoring 50% on Humanity’s Last Exam (HLE). Independent testing found 29.4%. The exact cause isn’t publicly confirmed, but likely explanations include differences in evaluation setup (prompting strategy, sampling parameters), the use of extended thinking or tool use in the official evaluation that wasn’t replicated independently, or genuine data contamination. This discrepancy highlights why self-reported scores should be treated with skepticism until independently verified.
Are all AI benchmarks unreliable?
Not entirely. Some benchmarks are more resistant to gaming than others. Humanity’s Last Exam was designed specifically to be hard enough to resist memorization. Chatbot Arena uses blind human preference ratings that are difficult to manipulate. LiveCodeBench rotates problems to reduce contamination risk. The issue isn’t that benchmarks are useless — it’s that self-reported scores on standard benchmarks require scrutiny.
What is data contamination in AI evaluation?
Data contamination happens when a model’s training data includes questions, answers, or closely related material from benchmark test sets. Because models can effectively memorize content they’ve seen repeatedly, a contaminated model can score well on a benchmark without actually having the reasoning ability the test is designed to measure. Contamination is hard to detect when training datasets aren’t published.
How can I tell if an AI model’s benchmark scores are trustworthy?
Look for third-party replication of the results. Check if the methodology is clearly disclosed — including prompt format, sampling settings, and whether all results are reported or just selected ones. Compare against blind evaluation platforms like Chatbot Arena. And, most practically, test the model on your own tasks rather than relying on general benchmark claims.
What’s the difference between self-reported and independent benchmark results?
Self-reported results are published by the company that built the model, often under evaluation conditions they control and design. Independent results come from third-party researchers running the same benchmarks under standardized conditions. Self-reported scores are systematically higher — not always because of deliberate manipulation, but because companies optimize their evaluation setup in ways that favor their models and then select which results to publish.
Key Takeaways
- Benchmark gaming covers a range of practices — data contamination, selective reporting, special prompting, metric cherry-picking — that inflate AI model scores without improving real capability.
- The Kimi K2 case (50% claimed vs. 29.4% independently measured on HLE) is a recent, well-documented example of this gap.
- Goodhart’s Law explains why the problem persists: as benchmarks become targets, they become worse measures.
- More reliable evaluation approaches include blind human-preference platforms (Chatbot Arena), third-party replication, and testing models directly on your own use case.
- When selecting AI models for real applications, self-reported leaderboard scores should be one data point — not the deciding factor.
If you’re building AI-powered workflows and need to choose between models honestly, MindStudio’s multi-model builder lets you run real comparisons on your actual tasks. Try it free at mindstudio.ai.