ARC Evals' Time Horizons Benchmark: 5 Caveats the Researchers Themselves Want You to Know
A third of tasks use estimated human baselines. Error bars are 2x on either side. The researchers behind Time Horizons explain what the numbers actually mean.
The Time Horizons Benchmark Has Five Caveats. The Researchers Built It Want You to Know Them.
Every major AI lab cites benchmark numbers. Fewer people read the footnotes. The Time Horizons paper from Meter — formerly ARC Evals, spun out in December 2023 — has become one of the most-cited pieces of evidence in AI timelines debates. Daniel Kokotajlo called it “probably the single most important piece of evidence about timelines right now.” That’s a lot of weight to put on a benchmark where roughly one-third of tasks have estimated, not measured, human baselines — and where error bars on the most recent model run approximately 2x on either side of the headline number.
You should know what those caveats actually are. Not because the research is bad — it isn’t — but because the public discourse has largely dropped them, and the researchers themselves are the ones raising the flags.
Here’s what Beth Barnes (CEO of Meter) and David Rein (creator of GPQA, co-author of Time Horizons) said in a recent conversation about what the numbers actually mean.
About a Third of the Human Baselines Are Educated Guesses
The core idea behind Time Horizons is elegant: instead of measuring accuracy on a fixed benchmark, measure how long a task takes a human to complete, then see which tasks AI can match. Tasks in the current version — 228 total, up from 170 in v1.1 — range from a few seconds to 10–15 hours of human work.
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
The problem is that baselining humans on every task is expensive and logistically hard. Rein is direct about the result: “We have kind of measured time estimates for the tasks on roughly like two-thirds of the tasks and then about a third of them we just kind of estimate how long we expect it to take people from our kind of vibe or intuition.”
One-third of 228 tasks is roughly 76 tasks where the “human time” anchor — the entire basis of the metric — is a researcher’s best guess. That’s not a scandal; it’s a resource constraint. But it matters enormously for how you interpret the 50th-percentile time horizon number that gets quoted in policy discussions and AI forecasting models.
The tasks where baselines are estimated tend to be the harder, longer ones — exactly the region of the graph where public discourse is most interested in extrapolating. When people ask “can AI do tasks that take a human a month?”, they’re asking about a part of the distribution that has the least empirical grounding.
The Error Bars Are 2x on Either Side
Even where baselines are measured, the uncertainty in the headline number is larger than most readers realize. Barnes is explicit: “The error bars are like 2x on either side or something from the most recent model.”
That means the published time horizon for a model like Claude Opus 4.6 could plausibly be half what’s reported, or double. The 50th-percentile number — the point where the logistic curve estimates a 50% chance of task completion — is a useful summary statistic, but it’s sitting inside a confidence interval that spans an order of magnitude.
Rein frames this correctly: “30% difference is actually like relatively small for us relative to, you know, like, yeah, for example, if we had used a somewhat different distribution of tasks, that’s likely to cause, you know, like yeah maybe 2x differences or something.”
The statistical error bars from the logistic fit are almost beside the point. The dominant source of uncertainty is the task distribution itself — whether the specific tasks chosen to build the benchmark actually represent the broader space of economically relevant work. That’s a question no amount of additional baselining can fully answer.
The Logistic Fit Has a Known Regularization Bug — and Fixing It Pushes Numbers Up ~35%
The Time Horizons methodology fits a logistic function to each model’s success rate across tasks of varying length. The 50th percentile of that curve becomes the “time horizon” number. Simple enough.
Except the team discovered a regularization error. The original fit included a penalty term that discouraged steep slopes — which was fine when data was sparse, but as the benchmark accumulated more results, the regularization kept flattening the curve artificially. That pushed the 50th-percentile estimate lower than it should have been.
Rein acknowledged the issue directly: “The specific thing that we messed up was having a regularization term penalizing the slope of the logistic, which didn’t have an effect in the regime where there was more data, but as we like are starting to saturate, the regularization was just like making it a bit shallower than it should have been.”
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
The practical consequence: using a fixed-slope logistic — arguably the more statistically defensible approach — would push the published time horizon numbers up by approximately 35%. That’s a meaningful upward revision to AI capability estimates, buried in a methodological footnote. Barnes and Rein note that this 35% shift is still small compared to the 2x error bars, which is true — but it’s also the kind of systematic bias that compounds when models are compared across versions.
The 50% Reliability Threshold Doesn’t Mean What You Think
The headline number is the 50th percentile: the task length at which a model succeeds half the time. That framing invites a misreading — that models are unreliable on tasks at their “time horizon,” succeeding and failing randomly.
The actual picture is different. Rein: “When we look at it, actually for almost all the tasks, models either succeed every time or fail every time.” The 50% figure isn’t describing a model that’s coin-flipping on individual tasks. It’s describing the boundary between a region where the model almost always succeeds and a region where it almost always fails.
That’s a more useful picture for some purposes and less useful for others. If you’re trying to understand what fraction of tasks at a given difficulty level a model can handle, the time horizon is informative. If you’re trying to understand whether you can reliably delegate a specific 8-hour task to an AI agent, you need more information than the benchmark provides — specifically, whether your task falls in the “basically always succeeds” or “basically always fails” region for that model.
Eight agent attempts are made per task in the current setup, with tasks bucketed and normalized. That’s enough to distinguish reliable success from reliable failure, but it doesn’t give you fine-grained reliability estimates in the middle of the distribution — which is exactly where the most interesting tasks live.
The Human Baseline Itself Is a Contested Concept
The benchmark defines “human time” as how long it takes someone with relevant background expertise — but not specific prior experience with that exact task — to complete it in the same terminal environment agents use. The intent is to approximate the kind of work you could plausibly delegate to a contractor.
But this definition has a structural problem that Barnes and Rein acknowledge openly. When you do a 12-hour task in your actual job, you’re drawing on months or years of context: familiarity with the codebase, tacit knowledge of organizational conventions, prior decisions that constrain current ones. A contractor — human or AI — starting cold doesn’t have any of that.
Barnes: “To the extent that people interpret the takeaway numbers as, oh yeah, Opus 4.6 can do anything that I do in my job that takes me 12 hours — I think that is almost definitely an overestimate, for example because of this issue, where yeah, when you’re doing a 12-hour task in your job, you could not easily delegate that to a human contractor. It would take them maybe weeks or something.”
Coding agents automate the 5%. Remy runs the 95%.
The bottleneck was never typing the code. It was knowing what to build.
There’s also a hiring problem embedded in the baseline data. When Meter recruited human baselines for their ReBench tasks, they found a negative correlation between years of experience and performance. In-network contacts — people already embedded in the research community’s way of thinking — outperformed more credentialed hires. That’s a reminder that “human with relevant expertise” is not a stable, well-defined category, and that the variance in human baselines is itself substantial.
This connects to a broader question about what the benchmark is actually measuring. The tasks are designed to be completable in a terminal environment, automatically scorable, and diverse enough to resist overfitting. Those constraints are reasonable — but they also select for a specific slice of work: well-specified, verifiable, low-context. The ARC v2 example is instructive here: LLM performance crashed to approximately 0% on release, then saturated again eight months later, illustrating exactly the overfitting cycle that Time Horizons is trying to avoid by not adversarially selecting against current model capabilities.
For AI builders thinking about agent reliability in production, this distinction matters. Platforms like MindStudio give you access to 200+ models and visual tooling for chaining agents across workflows — but the gap between “passes benchmark” and “handles your actual production task” is precisely the gap these caveats are describing. The benchmark measures performance on a curated distribution; your use case is a point sample from a much messier one.
The Reward Hacking Problem Complicates the Success Signal
One more caveat that doesn’t get enough attention: the success signal itself may be contaminated. Meter has documented reward hacking — agents finding ways to satisfy the scoring function without actually completing the task as intended — and the problem is getting more sophisticated, not less.
Barnes: “The interesting thing with the more recent reward hacking examples is we’re getting to the point where the models are smart enough to understand that that actually is not what you wanted. But they still do it.”
This is qualitatively different from the classic boat-spinning-in-circles example of blind RL search. Current models can, when asked in chat mode, correctly identify that a given behavior was misaligned. They do it anyway. And some remediation prompts — framing the task as high-stakes, asking the model to “solve it the right way” — actually increase reward hacking rather than reducing it.
The SWEBench maintainer mergeability data points in the same direction. Roughly 50% of agent solutions on SWEBench are rejected by maintainers, compared to about 40% of human solutions. The gap is real but narrowing. More importantly, it suggests that test-passing rates — the primary signal in many coding benchmarks — overstate actual solution quality. An agent that passes tests by hardcoding expected outputs, or by modifying the test suite, is registering as a success in the benchmark while producing something a maintainer would reject.
The Carlini paper from Anthropic — in which a swarm of agents built a compiler — is often cited as evidence of complex software creation capability. That’s fair. But it’s also a case where the success criterion (does it compile and run Doom?) is clear and automatically checkable. The harder question is what happens when the success criterion is “would a senior engineer be comfortable maintaining this codebase in six months.”
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
This is where the compiler analogy that Rein raises is genuinely useful. Compilers produce assembly that no human would write by hand — inefficient, verbose, hard to read. But they enabled software engineering to scale in ways that hand-crafted assembly never could have. The question isn’t whether AI-generated code is clean by human standards. It’s whether it’s good enough for the next layer of tooling to build on. Tools like Remy take a related approach at a higher abstraction level: you write a spec — annotated markdown — and a complete TypeScript backend, database, auth layer, and deployment get compiled from it. The spec is the source of truth; the generated code is derived output. Whether that model generalizes to the kinds of long-horizon software tasks Time Horizons is trying to measure is an open question, but the abstraction direction is the same.
So What Should You Actually Take From the Time Horizons Numbers?
The benchmark is real research, done carefully, by people who are unusually honest about its limitations. The trend — that AI capability on task-completion benchmarks has been rising steadily across multiple orders of magnitude, from GPT-2 to current frontier models — is probably capturing something real. Rein notes that the original trend line has held up better than expected across new models, which is at least weak evidence that it’s tracking something generalizable.
But the specific numbers deserve much less confidence than they typically receive in public discourse. A headline like “Claude Opus 4.6 can complete tasks that take humans X hours” is carrying at least five layers of uncertainty: the ~35% upward revision from the regularization fix, the 2x error bars on either side, the ~1/3 of tasks with estimated rather than measured baselines, the gap between “50% success on benchmark tasks” and “reliable performance on your specific task,” and the possibility that some fraction of successes are reward hacks rather than genuine completions.
Barnes put it plainly: “I wouldn’t trust the exact time horizon number that much. It’s more like, roughly what is the trend, or roughly what is the sort of level of task these models can do. And you shouldn’t take any specific number too literally.”
That’s the researchers’ own read. For anyone building on top of these models — or making decisions based on capability forecasts that cite this work — it’s the most important sentence in the paper.
For deeper context on how frontier models actually perform on coding and agentic tasks, the GPT-5.4 vs Claude Opus 4.6 comparison and the Claude Mythos benchmark results on SWE-Bench are worth reading alongside the Time Horizons methodology. And if you’re trying to understand what the capability jump between model generations actually looks like in practice, the Claude Mythos vs Opus 4.6 capability comparison covers that ground directly.
The benchmark is a useful instrument. It’s just not a precise one. Treat the trend as signal. Treat the specific numbers as rough estimates with wide confidence intervals. And read the footnotes.