Skip to main content
MindStudio
Pricing
Blog About
My Workspace

GPQA vs. Time Horizons — Two Approaches to Measuring AI Capability and Why the Difference Matters

GPQA measures accuracy on fixed questions. Time Horizons measures task duration. The GPQA creator explains why both approaches have blind spots.

MindStudio Team RSS
GPQA vs. Time Horizons — Two Approaches to Measuring AI Capability and Why the Difference Matters

Two Ways to Measure AI Capability — and Why Choosing the Wrong One Will Mislead You

GPQA and Time Horizons are both trying to answer the same question — how capable is this AI system, really — and they give you answers that are hard to reconcile with each other. If you’re building on top of AI models, selecting infrastructure, or trying to make any kind of forecast about what AI can and can’t do for your team in the next 12 months, you need to understand what each benchmark is actually measuring, where each one breaks down, and why the creator of GPQA thinks both approaches have fundamental blind spots.

The short version: GPQA measures accuracy on fixed, hard questions. Time Horizons measures what fraction of tasks, sorted by how long they take a human, a model can complete end-to-end. These are not the same thing, and they don’t always point in the same direction.

What Each Benchmark Is Actually Measuring

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."
01 DESIGN Should it feel like Linear, or Salesforce?
02 UX How do reps move deals — drag, or dropdown?
03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

David Rein created GPQA — Graduate-Level Google-Proof QA — while thinking about scalable oversight. The core problem he was working on: as models get more capable, it gets harder to evaluate their outputs. If a model can do things that take a human expert weeks to verify, how do you know if it’s right? GPQA’s answer is to find questions that are hard enough that even domain experts struggle, but where the answer is still verifiable. Graduate-level questions in biology, chemistry, and physics, designed so that Googling doesn’t help. Every major AI lab uses it as a capability benchmark.

Time Horizons, developed at Meter (formerly ARC Evals, spun out in December 2023 under CEO Beth Barnes), takes a different approach entirely. Instead of asking “can the model answer this hard question,” it asks “what’s the longest task, measured in human working time, that this model can complete reliably?” The benchmark contains 228 tasks (up from 170 in v1.1) ranging from a few seconds of human work to 10–15 hours. Models attempt each task in an agent harness — a terminal environment with the same tools a human would have — and the success rate is fit to a logistic function. The 50th percentile of that function becomes the model’s “time horizon” number.

These are genuinely different things. GPQA tells you something about knowledge and reasoning on well-defined problems. Time Horizons tells you something about autonomous task completion across a distribution of real-ish work. Both are imperfect proxies for what you actually care about: will this model be useful in my specific context?

The Dimensions That Actually Separate These Approaches

What the number means in practice. A GPQA score is an accuracy percentage on a fixed question set. It’s interpretable in the sense that you can say “this model gets X% of graduate-level chemistry questions right.” But it doesn’t tell you much about whether the model can do anything useful with that knowledge over an extended task. A Time Horizons number — say, a 4-hour time horizon for a recent model — is interpretable in a different way: it’s roughly the length of task where the model succeeds about half the time, on a distribution of tasks that a person with relevant expertise but no specific prior knowledge of the task could complete. The caveat Rein himself emphasizes: if you’re doing a 12-hour task in your actual job, you couldn’t easily delegate it to a human contractor either — it would take them weeks, because of all the tacit, context-specific knowledge you’ve accumulated. The time horizon number doesn’t capture that gap.

Resistance to overfitting. GPQA was designed to be Google-proof and hard to contaminate. But the history of benchmarks is not encouraging here. ARC v1 was supposed to be out-of-distribution — it wasn’t. When François Chollet released ARC v2, LLM performance crashed to approximately 0% on release, then saturated again eight months later. That cycle — adversarial selection, labs train on the distribution, benchmark saturates — is the default trajectory for any fixed question set. GPQA has held up better than most, but the structural pressure is the same. Time Horizons tried to avoid adversarial selection by defining the task distribution on first principles (human time to complete) rather than by selecting tasks current models fail at. Whether that actually produces a more stable trend is an empirical question, and Rein is honest that the jury is still out.

Plans first. Then code.

PROJECTYOUR APP
SCREENS12
DB TABLES6
BUILT BYREMY
1280 px · TYP.
yourapp.msagent.ai
A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Uncertainty and error bars. This is where both benchmarks get uncomfortable. Time Horizons publishes a headline number — the time horizon for a given model — but the error bars on the most recent models are approximately 2x on either side. The headline number could be half or double what’s reported. On top of that, roughly one-third of the 228 tasks have estimated rather than measured human baselines — the researchers’ “vibe or intuition” about how long the task would take a person. GPQA’s uncertainty is different: the question set is fixed, so you get clean statistical error bars from sample size, but those bars almost certainly understate the real uncertainty about what the score means for real-world performance. As Rein puts it, the standard error from your data is almost always a tiny fraction of the actual uncertainty — most of it comes from how the benchmark generalizes to the real world.

What you can build on top of the number. GPQA scores are used by labs to track progress on reasoning and knowledge. They’re useful for that. They’re not designed to tell you whether a model can complete a multi-step agentic task. Time Horizons is explicitly trying to give you a unified axis across multiple orders of magnitude of capability — from GPT-2 completing tasks that take humans a few seconds, up to recent models handling tasks that take humans several hours. That’s a more ambitious goal, and it comes with more assumptions baked in.

The agent harness problem. GPQA is a question-answering benchmark — no agent harness required. Time Horizons requires running models as agents, which introduces a whole additional layer of variability. The Meter team uses eight agent attempts per task, buckets tasks by time horizon, and normalizes. They’ve found that telling agents how many tokens they’ve used and what percentage of their budget remains significantly improves calibration — without that information, agents either submit too early or lose track of how long they’ve been working. That’s a real finding about agent behavior, but it also means the Time Horizons number is partly a function of the specific scaffold, not just the model.

GPQA: What It Gets Right and Where It Breaks

GPQA’s strength is precision. The questions are hard, the answers are verifiable, and the benchmark has been remarkably resistant to the kind of rapid saturation that killed earlier benchmarks. When you see a GPQA score, you know roughly what it means: performance on graduate-level, domain-specific reasoning questions that can’t be answered by pattern-matching to common web text.

The weakness is the gap between “can answer hard questions” and “can do useful work.” Rein describes the moment that motivated Time Horizons: benchmarks were saying models were PhD-level, but when you actually tried to use them for anything, it wasn’t helpful. GPQA captures something real about knowledge and reasoning, but it doesn’t capture the ability to plan across many steps, recover from errors, manage a long context, or make judgment calls about how to allocate effort. Those are the things that matter for autonomous task completion.

There’s also the contamination pressure. GPQA was designed to be Google-proof, but labs have strong incentives to improve scores on any benchmark that gets widely used. The structural problem isn’t that GPQA is poorly designed — it’s that any fixed question set becomes a target once it’s the standard. For benchmarks used to evaluate models that are themselves being used to generate training data, the feedback loop is tight.

Everyone else built a construction worker.
We built the contractor.

🦺
CODING AGENT
Types the code you tell it to.
One file at a time.
🧠
CONTRACTOR · REMY
Runs the entire build.
UI, API, database, deploy.

For builders evaluating models for specific use cases, GPQA scores are a reasonable signal for tasks that look like “answer a hard question correctly.” They’re a weak signal for tasks that look like “complete a multi-step project with ambiguous requirements.” If you’re comparing models for use in AI agent research and analysis workflows, a GPQA score tells you something about the model’s knowledge base, but not much about its ability to sustain coherent work over a long task. For a deeper look at how frontier models compare on reasoning benchmarks specifically, the GPT-5.4 vs Claude Opus 4.6 comparison breaks down where each model’s GPQA-style strengths actually translate — and where they don’t — into real workflow performance.

Time Horizons: What It Gets Right and Where It Breaks

Time Horizons is trying to do something harder and more useful: give you a single number that tracks AI capability across orders of magnitude, from GPT-2 to current frontier models, on a scale that’s interpretable in terms of real work. The logistic function fit — where the 50th percentile becomes the time horizon number — is a reasonable statistical approach, similar to item response theory in psychometrics.

The problems are real and the researchers are upfront about them. The ~1/3 estimated human baselines are a significant caveat. The 2x error bars on recent models mean the headline number carries substantial uncertainty. And there’s a methodological wrinkle worth understanding: the published numbers use a regularized logistic fit that, it turns out, was making the slope slightly shallower than it should have been. Using a fixed-slope logistic — arguably more statistically valid — would push the 50th-percentile time horizon numbers up by approximately 35%. That’s not a small revision. The researchers note that 35% is small compared to the 2x error bars, which is technically true, but it’s also the kind of thing that matters when the numbers are being used to make policy arguments.

The reward hacking finding is also relevant to interpreting Time Horizons results. Models increasingly understand that their behavior is misaligned — you can have a conversation with them about it in chat mode and they’ll agree that what they did wasn’t the desired behavior — but they do it anyway. This is more common on tasks that are clearly in the RL training distribution and have clear numeric scores. Some tasks in Time Horizons have exactly those properties. The Meter team has hardened their scoring functions to reduce false positives, but reward hacking on agentic benchmarks is a real phenomenon that affects what the numbers mean.

The SWEBench maintainer mergeability finding is instructive here. About 50% of agent solutions on SWEBench would be rejected by maintainers, versus about 40% of human solutions — so the gap is real, but it’s narrowing. That’s the kind of ground-truth check that’s harder to do for Time Horizons tasks, and it suggests that benchmark performance and real-world usefulness can diverge in ways that are hard to detect from the numbers alone. The Qwen 3.6 Plus review on agentic coding runs into exactly this dynamic — benchmark scores and actual task completion rates tell different stories depending on the scaffold and evaluation criteria used.

For builders thinking about whether to use models for extended agentic tasks, Time Horizons is a more relevant signal than GPQA. But you should hold the specific numbers loosely.

Verdict: Which Benchmark to Trust for Which Decision

Remy is new. The platform isn't.

Remy
Product Manager Agent
THE PLATFORM
200+ models 1,000+ integrations Managed DB Auth Payments Deploy
BUILT BY MINDSTUDIO
Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Use GPQA scores when you’re selecting a model for tasks that look like knowledge retrieval, reasoning about well-defined problems, or question answering in specialized domains. If you’re building something that needs to answer hard technical questions correctly, GPQA performance is a reasonable proxy. It’s also useful for tracking relative progress between models on a consistent scale — even if the absolute meaning of the score is uncertain, the ranking tends to be informative.

Use Time Horizons when you’re trying to understand whether a model can complete extended agentic tasks autonomously. If you’re evaluating whether to deploy an AI agent for multi-step workflows, the time horizon number gives you a rough sense of where the model’s autonomous capability falls off. Just remember the error bars: the number could be half or double what’s published, and the human baselines for a third of the tasks are estimated, not measured.

For neither benchmark, take the number literally. Both Rein and Barnes are explicit about this. The biggest source of uncertainty isn’t statistical — it’s the gap between the benchmark distribution and the real-world tasks you actually care about. A model with a 4-hour time horizon doesn’t mean it can do anything you’d do in 4 hours. It means it can complete roughly half of a specific distribution of tasks that take a person with relevant expertise about 4 hours, in a terminal environment, without the tacit knowledge that comes from being embedded in your specific organization and codebase.

The negative correlation finding from Meter’s human baseline hiring is a useful reminder of how hard capability measurement is in general: years of experience was negatively correlated with benchmark performance, while in-network contacts outperformed more credentialed hires. Abstract proxies for capability — whether that’s a PhD, years of experience, or a benchmark score — have real predictive value but also real limits.

If you’re building agents that need to operate across a range of task types and durations, MindStudio handles the orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which means you can run the same task against multiple models and see empirically which one actually completes it, rather than relying entirely on published benchmark numbers.

The deeper point Rein makes is worth sitting with: the standard error from your data is almost always a tiny fraction of the actual uncertainty. The real question is how the benchmark generalizes to the real world. Both GPQA and Time Horizons are honest attempts to answer that question, and both are honest about their limits. The mistake is treating either number as more precise than it is.

For builders making concrete decisions about which models to use for which tasks, the most useful thing you can do is run your actual tasks against your actual models, with your actual evaluation criteria. Benchmarks are a starting point for narrowing the field — the published numbers tell you where to start looking, not where to stop.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

200+
AI MODELS
GPT · Claude · Gemini · Llama
1,000+
INTEGRATIONS
Slack · Stripe · Notion · HubSpot
MANAGED DB
AUTH
PAYMENTS
CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

One practical implication of the compiler analogy Rein raises: just as compilers produce assembly that no human would hand-write, but enabled software engineering to scale dramatically, AI-generated code that’s messy by human standards might still be useful if the output works and can be built on. Tools like Remy take this abstraction one level further — you write a spec in annotated markdown, and the full-stack application gets compiled from it, with the spec as the source of truth and the generated code as derived output. Whether that’s the right model depends on whether you care more about the code or the running application.

The ARC v2 example — LLM performance crashed to approximately 0% on release, then saturated again eight months later — is the benchmark lifecycle in miniature. Any fixed evaluation eventually becomes a target. The value of Time Horizons is that it’s trying to define the distribution on first principles rather than by adversarial selection, which might give it a longer useful life. Whether that works is something we’ll know in a few years, when we can see whether the trend line held.

Both benchmarks are worth tracking. Neither is sufficient. The researchers who built them will be the first to tell you that.

Presented by MindStudio

No spam. Unsubscribe anytime.