Time Horizons Benchmark Numbers Are Understated by ~35% — Here's the Statistical Reason Why

The Statistical Flaw That Makes Time Horizons Look Conservative

The Time Horizons benchmark — published by Meter (formerly ARC Evals) — reports that Claude Opus 4.6 can complete tasks that take humans roughly several hours. That headline number is already striking. But there’s a methodological detail buried in the paper that pushes the real number up by approximately 35%: the regularization term on the logistic slope.

David Rein, co-author of the Time Horizons paper and creator of the GPQA benchmark, explained it directly in a recent interview: the original fit used a regularization term that penalized steep slopes on the logistic function. In data-rich regions, this had no effect. But as the benchmark started saturating at the high end — where tasks take 10–15 hours of human work — the regularization artificially flattened the curve, pulling the 50th-percentile crossing point downward. A fixed-slope logistic, which Rein describes as arguably more statistically valid, would push the published time horizon numbers up by around 35%.

That’s not a rounding error. That’s a meaningful upward revision to the central capability estimate from the benchmark that Daniel Kokotajlo called “probably the single most important piece of evidence about timelines right now.”

What the Time Horizons Methodology Actually Does

Before you can reason about the error, you need to understand what’s being measured.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

The benchmark contains 228 tasks (up from 170 in v1.1), ranging from tasks that take humans a few seconds to tasks requiring 10–15 hours of focused work. The human time-to-complete is the x-axis. Model success rate is the y-axis. A logistic function is fit to the resulting scatter of successes and failures, and the 50th percentile of that function — where the model is estimated to succeed on half the tasks at that difficulty level — becomes the “time horizon” number for that model.

The intuition is clean: instead of asking “what percentage of GPQA questions does this model get right,” you’re asking “how long a task can this model reliably handle?” That gives you a single number that’s comparable across GPT-2 and Opus 4.6, even though those models are operating in completely different capability regimes.

The logistic fit is borrowed from item response theory — the same statistical framework used to calibrate exam questions. You have task difficulty parameters and model ability parameters, and you’re fitting them simultaneously. Rein’s stated preference is to keep the analysis legible: if you can’t eyeball the graph and see roughly where the curve crosses 50%, you probably shouldn’t trust a complicated statistical procedure to find it for you.

The Regularization Bug and the 35% Revision

Here’s the specific failure mode. The logistic function has two free parameters: where it crosses 50% (the location) and how steeply it rises (the slope). The original Time Horizons fit included a regularization term that penalized steep slopes — a reasonable prior when you have sparse data, because steep slopes can overfit to noise.

The problem: as models improved and started succeeding on longer tasks, the benchmark began saturating at the high end. In that regime, the regularization term was no longer neutral — it was actively pulling the slope shallower than the data warranted. A shallower slope means the 50% crossing point moves left (toward shorter tasks), which means the published time horizon number is lower than a slope-unconstrained fit would produce.

Switching to a fixed-slope logistic — where the slope is set to a reasonable constant rather than optimized with a regularization penalty — would push the 50th-percentile time horizon up by roughly 35% for the most recent models.

Rein’s framing: “They’re small compared to the error bars. The error bars are like 2x on either side or something from the most recent model.”

That’s the part that deserves to sit with you for a moment. The 35% upward revision from fixing the slope is real and meaningful. And it’s still smaller than the fundamental uncertainty in the measurement.

The Error Bars Are the Real Story

The ±2x error bars on Opus 4.6’s time horizon aren’t a sign of sloppy work. They’re an honest accounting of what’s actually uncertain.

The sources of uncertainty stack up quickly. About one-third of the 228 tasks have estimated rather than measured human baselines — Rein describes these as based on “vibe or intuition” when direct measurement wasn’t feasible. The human baseline measurements that do exist show roughly 3x variation between individuals, even among people selected for appropriate expertise. The task distribution itself is a benchmark artifact: tasks have to be automatically scorable, completable in a terminal environment, and verifiable without expensive human review. That selection pressure means the benchmark is not a random sample of economically relevant work.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

Eight agent attempts are made per task. Tasks are bucketed and normalized to handle the uneven distribution across difficulty levels. All of this is reasonable methodology, but each step introduces variance that compounds.

The result: the headline number for Opus 4.6 could plausibly be half what’s reported, or double. The 35% revision from fixing the logistic slope is a systematic correction to a known bias. The 2x error bars are the irreducible uncertainty from everything else.

This is worth keeping in mind when you see the Time Horizons chart cited in policy discussions or forecasting documents. The trend line — showing consistent improvement from GPT-2 through current models — is probably the most reliable signal. The specific numbers at any given point are much noisier than they appear.

Why the Benchmark Design Choices Matter

The Time Horizons approach was deliberately designed to avoid the adversarial selection problem that plagued earlier benchmarks. ARC v1 and v2 are the canonical example: tasks were selected specifically because current models failed at them, which created a regression-to-the-mean dynamic where future models would show dramatic gains just by training on similar distributions. ARC v2 saw LLM performance crash to approximately 0% on release, then saturate again eight months later — not because of genuine capability jumps, but because of benchmark overfitting.

Time Horizons tries to sidestep this by defining the task distribution on first principles (human time-to-complete) rather than adversarially selecting against current model capabilities. The hope is that a more principled distribution produces steadier, more interpretable trends.

The tradeoff is that the benchmark ends up including tasks that are relatively easy for current models — which is fine for measuring trends but means the benchmark isn’t maximally discriminating at the frontier. For comparing frontier models like GPT-5.4 and Claude Opus 4.6, you’d want a benchmark that’s harder at the top end, not one optimized for longitudinal comparability.

What “50% Reliability” Actually Means

One common misreading of the time horizon number is treating it as a reliability threshold — as if Opus 4.6 succeeds on exactly half of all tasks that take humans that long. That’s not quite right.

Rein’s clarification: for most tasks, models either succeed reliably or fail reliably. The 50% crossing point on the logistic isn’t describing a regime of consistent 50/50 uncertainty — it’s describing the boundary between the “basically always succeeds” region and the “basically always fails” region. The logistic fit is a smooth approximation of what’s actually a fairly sharp transition.

This matters for how you interpret the benchmark in practice. If you’re asking “can I use this model for tasks of this length,” the answer isn’t “you’ll succeed half the time.” It’s closer to “tasks shorter than this, you’ll probably succeed; tasks longer than this, you’ll probably fail; the exact boundary depends on the specific task in ways the time horizon number doesn’t capture.”

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

The SWEBench maintainer mergeability data is a useful reality check here. Meter found that roughly 50% of agent solutions on SWEBench would be rejected by maintainers, compared to roughly 40% of human solutions. The gap is real but narrowing. This is the kind of external validation that helps calibrate what the time horizon numbers actually mean for production use — and it suggests the benchmark is measuring something real, even if the specific numbers are uncertain.

Token Budget Awareness as a Calibration Tool

One finding from Meter’s scaffolding work is directly applicable if you’re building agents: telling the agent how many tokens it has used and what percentage of its budget remains significantly improves calibration.

Without this information, agents tend to either submit solutions too early or spend time in ways that don’t reflect the actual constraints of the task. Humans have implicit signals about task duration — a manager saying “I’m excited to see results tonight” communicates a time budget without stating one explicitly. Agents operating from a prompt don’t have those signals unless you provide them.

The fix is straightforward: inject token usage and budget percentage into the agent’s context at regular intervals. Meter found this made a meaningful difference in agent behavior on their benchmark tasks. If you’re building agentic workflows — whether with Claude Code’s effort level settings or custom scaffolding — this is a low-cost intervention worth implementing.

Platforms like MindStudio handle this kind of orchestration scaffolding at the infrastructure level: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows, which means you can implement token-budget-aware agents without writing the plumbing from scratch.

The Compiler Analogy for AI-Generated Code

The Time Horizons benchmark includes tasks like training masked language models without using division or exponentiation operators — tasks designed to require genuine problem-solving rather than pattern matching against training data. The Carlini paper from Anthropic, where a swarm of agents built a compiler, is cited as evidence that complex software creation is within reach.

Rein’s compiler analogy is worth sitting with. Before compilers, programmers hand-crafted assembly — every register used efficiently, no wasted memory. Compilers produce “garbage” machine code by comparison: bloated, unoptimized, not what a human expert would write. But compilers enabled software engineering to scale in ways that hand-crafted assembly never could.

The same dynamic may apply to AI-generated code. The code is often poorly factored, with control flow scattered across files in ways that would concern a senior engineer. But if the output works — if you can actually build complex systems with it — the question of whether it’s “good code” by human standards may be less important than whether it’s functional and extensible.

This is exactly the abstraction argument behind tools like Remy, which treats the spec as the source of truth and generates a complete TypeScript backend, SQLite database, auth, and deployment from annotated markdown. The generated code isn’t hand-crafted — it’s derived output. Fix the spec, recompile. The question isn’t whether the assembly is beautiful; it’s whether the program runs.

The Negative Experience Correlation

One finding from Meter’s human baseline work that doesn’t get enough attention: years of experience was negatively correlated with benchmark performance in their hiring process.

The people doing best on the baseline tasks were in-network contacts — people culturally aligned with how Meter thinks about problems. More credentialed candidates, with longer CVs and more years of experience, were actually performing worse on the tasks.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

This has a direct implication for how you interpret the human baseline numbers in Time Horizons. The “human” in “human time-to-complete” is not a median worker or a credentialed expert — it’s a somewhat idiosyncratic sample of people who happened to be available and appropriate for the tasks. The 3x variation in baseline times across individuals reflects this. The benchmark is measuring something real, but the human reference point is noisier than it looks.

Reading the Chart Correctly

The Time Horizons chart shows consistent improvement from GPT-2 through current models, plotted on a log scale. The trend line has held up better than expected — Rein describes being “surprised” by how well the original trend has continued.

But the right way to read the chart is as a trend, not as a precise capability statement. The 35% upward revision from fixing the logistic slope is a systematic correction that should shift your priors about where current models actually sit. The 2x error bars mean the specific number for Opus 4.6 is genuinely uncertain. The one-third of tasks with estimated (not measured) human baselines adds another layer of noise.

The honest summary: models are improving on tasks that take humans hours to complete, the improvement is consistent and measurable, and the specific numbers are uncertain enough that you should weight the direction of the trend more heavily than any particular data point.

That’s a more useful frame than either “AI can do anything a human can do in 12 hours” or “these benchmarks are meaningless.” The trend is real. The error bars are also real. Both things are true simultaneously, and the 35% revision from a statistical correction is a good reminder that published benchmark numbers deserve more scrutiny than they typically receive.

For anyone building AI agents for research and analysis or trying to calibrate what current models can actually do autonomously, the Time Horizons methodology — with its known limitations fully accounted for — is still one of the more principled frameworks available. Just don’t mistake the headline number for a precise measurement.