What Is the Humanities Last Exam Benchmark? How Independent Testing Revealed a 21-Point Score Inflation
Kimi K2 self-reported 50% on HLE. Independent testing found 29.4%. Here's how the HLE benchmark works and why third-party verification matters.
A 21-Point Gap Is Hard to Ignore
When Moonshot AI released Kimi K2, their benchmark card listed a score of roughly 50% on the Humanities Last Exam. That number, if accurate, would have put the model among the top performers on one of the hardest AI evaluations in existence.
Independent testers ran it themselves. They got 29.4%.
That’s a 21-point gap — not a rounding error or a methodology quirk, but a fundamental disagreement about how capable the model actually is. This kind of discrepancy is exactly why the AI community keeps returning to one uncomfortable question: can we trust the numbers that labs publish about their own models?
To understand what went wrong, you first need to understand the Humanities Last Exam benchmark — what it is, how it’s scored, and why it’s become one of the most important (and most contested) evaluation tools in AI today.
What the Humanities Last Exam Actually Is
The Humanities Last Exam, commonly abbreviated as HLE, is a multiple-choice and short-answer benchmark developed by the Center for AI Safety and Scale AI. It was publicly released in early 2025 and immediately drew attention because of its stated purpose: to be a benchmark so difficult that frontier AI models couldn’t pass it easily — at least not yet.
The name is a deliberate reference to an older benchmark called MMLU (Massive Multitask Language Understanding), which once tested the limits of language models but has since become too easy for top-tier systems. HLE was designed to replace it as the benchmark that actually separates capable models from exceptional ones.
It covers an unusually broad range of domains, including:
- Mathematics and formal logic
- Physics, chemistry, and biology
- History, philosophy, and linguistics
- Law, economics, and social sciences
- Literature, art history, and classical studies
But HLE doesn’t just ask hard questions in these fields — it asks questions that were specifically designed to be resistant to pattern-matching and memorization. Many questions were sourced directly from domain experts, including PhD students, professors, and researchers, who were asked to write problems that would stump even the most knowledgeable person without deep subject expertise.
Why HLE Is Considered Unusually Difficult
Most benchmarks test whether a model can recall and apply commonly known information. HLE goes further. Many of its questions require:
- Synthesizing information across multiple specialized domains
- Drawing on niche knowledge that doesn’t appear frequently in training data
- Reasoning through multi-step problems with no obvious shortcut
- Answering in ways that can’t be easily guessed from the structure of the question
When GPT-4 was first evaluated on HLE at the time of its release, it scored in the single digits. As of mid-2025, the leading models score somewhere between 30% and 55% on the benchmark, depending on who’s testing and how.
That last part — “depending on who’s testing” — is the crux of the Kimi K2 situation.
How HLE Scoring Works
Before getting into the discrepancy, it helps to understand how the benchmark is actually administered and scored.
The Basic Setup
HLE questions fall into two formats: multiple-choice and exact-match short answer. Multiple-choice questions are scored as correct or incorrect based on which option the model selects. Short-answer questions require the model to produce a specific answer — often a number, a proper noun, or a precise technical term — that is matched against a known correct answer.
The score reported is simply the percentage of questions answered correctly out of the total.
Where Variation Creeps In
Even with a standardized benchmark, there are several places where evaluation choices can change results significantly:
Prompt formatting. How the question is presented to the model matters. Does it include examples? Is the model told to think step-by-step before answering? Does it get system-level instructions that prime it to be careful or conservative? Different prompt templates can produce meaningfully different scores.
Temperature and sampling settings. Higher temperature settings produce more varied outputs. When evaluating accuracy, lower temperatures (closer to deterministic output) typically give more reliable results — but not every evaluator uses the same settings.
Pass@1 vs. majority voting. Some evaluators report the best score the model achieves across multiple attempts (pass@k). Others report what it gets on a single attempt (pass@1). These numbers can diverge substantially on hard benchmarks.
Context window usage. For long or complex questions, how much context the model is allowed to use before generating an answer can affect performance.
Who administers the test. A model developer running their own evaluation has obvious incentives — conscious or not — to choose the setup that yields the best result. An independent evaluator has no such incentive.
The Kimi K2 Case in Detail
Kimi K2 is a mixture-of-experts model released by Moonshot AI in mid-2025. When it launched, its reported HLE score of approximately 50% was widely circulated in AI coverage. That number, if accurate, would place it ahead of most competitors on one of the hardest public benchmarks available.
Several independent researchers decided to verify the claim.
What Independent Testing Found
Third-party evaluators, running the benchmark under standard conditions with consistent prompt formatting, found Kimi K2 scoring approximately 29.4% on HLE — a gap of more than 20 percentage points from the developer-reported figure.
This wasn’t a minor discrepancy attributable to natural variance. On a benchmark with hundreds of questions, this kind of gap is statistically significant. It suggests a systematic difference in evaluation methodology, not random noise.
The Most Likely Explanations
Several plausible explanations have been discussed in the AI evaluation community:
Optimized prompting. Developers often test with prompts specifically tuned to help their model perform well on known benchmarks. This can include system prompts that activate particular reasoning modes or instruct the model to double-check its answers. Independent testers use simpler, standardized prompts.
Selective reporting. Some labs run evaluations multiple times and report the best result. Or they may test on a subset of questions where the model performs well and extrapolate from that.
Extended inference settings. Some scores are generated using “test-time compute” — giving the model more time or more attempts to reason through each answer. This can substantially boost scores but isn’t representative of normal inference conditions.
Data contamination. If a model has seen HLE questions (or very similar questions) during training, it will perform better than it would on genuinely novel problems. This is notoriously difficult to verify from the outside.
None of these explanations require bad faith on the part of the developer. Benchmark inflation often happens through a series of reasonable-seeming choices, each of which favors performance, that compound into a misleading final number.
Why Benchmark Inflation Is a Systemic Problem
The Kimi K2 case isn’t unique. AI benchmark scores have been inflating relative to real-world capability for years, across many labs and many evaluations.
There are structural reasons for this.
The Incentives Are Misaligned
Publishing strong benchmark numbers drives media coverage, attracts enterprise customers, and helps with fundraising. The incentive to report the best possible number is enormous. There’s no equivalent incentive to report the number that’s most representative of typical performance.
Benchmarks Become Targets
Once a benchmark is widely adopted, model development starts to optimize for it — sometimes explicitly, sometimes just because training data happens to overlap with benchmark content. The result is that a model can score well on a benchmark without actually being more capable in the way the benchmark was designed to measure.
This is a version of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure.
HLE Was Designed to Resist This
Part of what makes HLE valuable is that it was specifically designed to be harder to game. The questions are obscure enough that they’re unlikely to appear frequently in training data. The domains are broad enough that narrow optimization is difficult. And the questions are hard enough that getting lucky answers by pattern-matching is unlikely.
But as the Kimi K2 case shows, even a carefully designed benchmark can’t fully solve the evaluation incentive problem. What it can do is make independent verification more meaningful — because a model that genuinely scores well on HLE has actually demonstrated something.
Why Third-Party Verification Matters More Than the Number
The real story here isn’t that Kimi K2 underperformed. A 29.4% score on HLE is still a significant result — many capable models score lower. The problem is the gap between what was claimed and what was independently measured.
That gap matters for several practical reasons.
Enterprises Are Making Real Decisions Based on These Numbers
Companies choosing which LLM to integrate into their products — for customer service, document processing, code generation, legal analysis — often use benchmark scores as a primary signal. A model that claims 50% on HLE and actually delivers 29.4% will underperform on complex tasks, and that underperformance will surface at the worst possible time.
Independent Testing Provides a Consistent Baseline
When independent evaluators run the same benchmark with the same methodology across multiple models, the results become meaningful for comparison even if the absolute scores shift. A model that scores 35% in independent testing consistently outperforming a model that scores 30% tells you something real. A developer-reported 50% vs. another developer-reported 45% tells you almost nothing.
The Community Can Catch What Labs Miss
The open nature of AI evaluation — where researchers, developers, and users run their own tests and publish their findings — acts as a correction mechanism. The Kimi K2 discrepancy was identified relatively quickly because people cared enough to check. That kind of distributed verification is one of the more functional parts of the current AI ecosystem.
Evaluating AI Models for Real-World Use
Understanding HLE score inflation has a practical implication: when choosing a model for actual work, benchmark claims should be treated as a starting point, not a conclusion.
What to Look For Instead
When evaluating which model to use for a specific application, more informative signals include:
- Task-specific performance — How does the model actually perform on the type of task you care about? Running a small, focused evaluation on your own use case is more useful than any published benchmark number.
- Independent or community-generated scores — Results from organizations with no stake in the outcome are more trustworthy than self-reported ones.
- Consistency across attempts — A model that occasionally produces brilliant results but frequently fails is less useful than one that produces consistently good results.
- Cost-performance ratio — A model that scores slightly lower on benchmarks but costs significantly less per token may be the better practical choice.
How MindStudio Simplifies Model Selection
This is where MindStudio becomes relevant. Because model selection matters so much, and because no single model is best for every task, MindStudio gives you access to 200+ AI models from a single platform — no separate API keys, no account juggling, no infrastructure setup.
That breadth matters in the context of benchmark inflation. When you’re not locked into a single model, you can actually test multiple models on your specific task and see which one performs best in practice. You’re not choosing based on a developer’s reported HLE score; you’re choosing based on results you observed yourself.
If Kimi K2 performs well on your actual use case, great — use it. If it underperforms relative to its claimed scores, you can switch to Claude, GPT-4o, Gemini, or any of the other models available without changing your workflow or rebuilding your agent.
MindStudio’s visual builder also means you can set up that kind of side-by-side comparison quickly — often in under an hour — and see real output differences across models on your actual prompts and data. You can explore how to build your first AI agent on MindStudio for free at mindstudio.ai.
The Broader State of AI Evaluation in 2025
The HLE benchmark and the Kimi K2 case are part of a larger conversation about how to evaluate AI systems honestly.
Evaluations Are Getting Harder to Run Well
As models get more capable and more expensive to run, thorough evaluation requires significant compute. Not every independent researcher can afford to run a full benchmark evaluation at the scale needed to get statistically reliable results. This asymmetry — labs can run expensive evaluations, independent researchers often can’t — tends to favor developer-reported numbers by default.
Some Organizations Are Trying to Fix This
Organizations like Epoch AI, EleutherAI, and various university research groups are working to establish more rigorous, independent evaluation infrastructure. The AI safety community has pushed for standardized evaluation protocols. Efforts like the Chatbot Arena (which uses human preference ratings rather than fixed benchmark questions) offer a different angle on model quality.
None of these fully solve the problem. But they’re creating more reference points against which suspicious scores can be checked.
What Hasn’t Changed
The fundamental challenge is that measuring intelligence — artificial or otherwise — is genuinely hard. A benchmark score is a proxy, and every proxy can be gamed, optimized, or accidentally inflated. HLE is better than most benchmarks precisely because it’s harder to game, but “harder to game” isn’t the same as “impossible to inflate.”
The most honest framing is this: no single benchmark score tells you whether a model is good. It tells you how a specific model performed on a specific set of questions under specific conditions at a specific point in time. The Kimi K2 situation is a clear example of what happens when that context gets stripped away and a number gets treated as a fact.
Frequently Asked Questions
What is the Humanities Last Exam (HLE) benchmark?
HLE is a multi-domain AI benchmark developed by the Center for AI Safety and Scale AI. It was designed to be significantly harder than older benchmarks like MMLU, with questions sourced from domain experts across fields including mathematics, science, history, philosophy, and law. A high score on HLE is considered a meaningful signal of advanced reasoning ability. For context on how AI benchmarks have evolved, see Scale AI’s evaluation research.
Why did Kimi K2 show a 21-point gap between self-reported and independent scores?
The gap between Kimi K2’s self-reported HLE score (~50%) and independently tested score (~29.4%) likely reflects differences in evaluation methodology. Common sources of inflation include optimized prompting, test-time compute settings that go beyond standard inference, selective reporting of best results, or potential data contamination. This doesn’t necessarily indicate deliberate manipulation — these issues often arise from well-intentioned choices that compound into a misleading number.
How is HLE scored?
HLE questions are either multiple-choice or short-answer format. Multiple-choice questions are scored as correct or incorrect. Short-answer questions require an exact or very close match to a known correct answer. The final score is the percentage of all questions answered correctly. Scores on HLE tend to be lower than on most other benchmarks because the questions are designed to be genuinely difficult and resistant to pattern-matching.
What’s the difference between self-reported and independent benchmark scores?
Self-reported scores are published by the organization that built the model. They may use evaluation setups optimized to produce the best possible result. Independent scores are produced by third parties with no stake in the outcome, typically using standardized prompts and settings. Independent scores are generally considered more reliable for cross-model comparison, even if they produce lower absolute numbers.
How should I use benchmark scores when choosing an AI model?
Treat them as one signal among many, not a definitive answer. Look for independent scores in addition to developer-reported ones. Run your own small evaluation on the specific task you care about — real-world performance on your use case matters more than any general benchmark. Also consider inference cost, latency, context window size, and how consistently the model performs, not just its peak score.
Is HLE the best benchmark for evaluating AI models in 2025?
HLE is among the most respected benchmarks for measuring high-end reasoning ability, but no single benchmark is best for all purposes. HLE correlates well with performance on hard analytical tasks, but it’s less informative for evaluating code generation, conversational quality, or creative writing. The most useful approach is to use multiple benchmarks alongside task-specific testing. Platforms that let you test multiple models directly — like MindStudio’s multi-model agent builder — can make that kind of real-world comparison faster and more practical.
Key Takeaways
- The Humanities Last Exam is one of the most rigorous AI benchmarks available, designed specifically to resist memorization and pattern-matching, covering dozens of academic and professional domains.
- Kimi K2’s self-reported HLE score of ~50% was found to be ~29.4% in independent testing — a 21-point gap that highlights how evaluation methodology can dramatically inflate reported numbers.
- Score inflation typically isn’t deliberate fraud. It often results from optimized prompting, extended inference settings, selective reporting, or data contamination — each individually defensible, collectively misleading.
- Independent verification matters because companies and developers make real model-selection decisions based on benchmark numbers.
- For practical use, the most reliable approach is to evaluate models yourself on your actual task — and to use platforms that make multi-model comparison easy rather than committing to a single model based on its marketing materials.
- MindStudio’s access to 200+ models from one platform makes it straightforward to run your own model comparisons without separate accounts, API keys, or infrastructure overhead. Try it free at mindstudio.ai.