What Is the Jagged Frontier? Why AI Models Improve Unevenly

The Uneven Edge of AI Capability

AI models don’t improve uniformly. One model can draft a legal brief at near-expert level, then fail to count the number of times a letter appears in a word. Another can solve a graduate-level math proof but trip over basic spatial reasoning that a child handles easily. This is the jagged frontier: the uneven, unpredictable boundary of what AI can and cannot do.

Understanding the jagged frontier isn’t just an academic exercise. It directly affects how you pick models, design workflows, and set expectations for what AI will actually deliver — and where it will quietly let you down.

This article explains what the jagged frontier is, why it exists, and how to use that knowledge to make better decisions when building with or deploying AI.

Where the Concept Comes From

The term “jagged frontier” gained traction from research published by Harvard Business School in 2023. The study placed consultants from a major firm in controlled conditions — some using AI assistance, some not — and measured performance across a range of tasks.

The results were striking. On tasks inside AI’s capability boundary, consultants using AI significantly outperformed those working alone. But on tasks just outside that boundary, the AI-assisted group underperformed. The AI confidently produced plausible-looking but wrong output, and consultants trusted it.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

The metaphor is the frontier — think of a line on a map marking what territory AI controls. But instead of a clean border, it’s jagged: jutting far forward in some areas (sophisticated analysis, structured writing, code generation) and falling far back in others (spatial tasks, novel reasoning, tasks requiring genuine world interaction).

This jagged shape is the key insight. You can’t assume that because an AI handles one hard task well, it will handle adjacent tasks at all.

Why the Frontier Is Jagged, Not Smooth

If AI models got uniformly better with each generation, the frontier would gradually expand outward in all directions. Progress would be predictable. You’d know that a newer, larger model is better at everything than an older, smaller one.

That’s not what happens. Several factors create the jagged shape.

Training Data Distribution

Large language models learn from text. They’re exceptionally good at tasks that are well-represented in human writing: summarizing, drafting emails, explaining concepts, writing code in popular languages. These are tasks humans write about constantly.

But some tasks are rarely described in text. How to mentally rotate a 3D object. How to track a moving target across frames. How to reason about physical causality in real-time. These capabilities don’t have a rich text substrate to learn from, so models develop them poorly or not at all — regardless of how large the model gets.

Benchmark Saturation vs. Real Capability

AI labs train against benchmarks. When a benchmark becomes well-known, models gradually get optimized for it — whether through direct training on benchmark-adjacent data or through more systematic benchmark gaming. Scores on those benchmarks stop being good proxies for actual capability.

Meanwhile, capability on tasks that aren’t benchmarked can lag far behind. The frontier advances where there’s measurement pressure, and stagnates where there isn’t.

This is why benchmarks designed to be hard to game reveal such a different picture. ARC-AGI 3, for instance, presents novel visual puzzles that require flexible reasoning — not pattern matching on training data. Frontier models that score impressively on standard benchmarks have scored 0% on ARC-AGI 3. The frontier juts backward sharply in that direction.

Emergent Capabilities at Scale

Some capabilities appear suddenly as models get larger, rather than improving gradually. A model at one parameter count might be completely unable to do multi-step arithmetic, then a larger version develops that ability with no incremental steps in between.

This means the frontier doesn’t just expand outward — it occasionally jumps forward in specific areas. A capability that was outside the frontier yesterday might be well inside it today after a threshold is crossed in training. That makes the jagged shape even harder to predict.

Task Difficulty Is Counterintuitive

Human intuition about what’s “hard” doesn’t map onto AI difficulty at all. Tasks that took humans centuries to master — like playing chess or Go at grandmaster level — turned out to be relatively tractable for AI. Tasks that humans do effortlessly — recognizing a friend across a parking lot, following a conversation in a noisy room — proved far harder.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

This means the jagged frontier cuts across difficulty in ways that surprise people. An AI might ace a bar exam but fail to correctly identify which of two images has more objects in it. The frontier isn’t correlated with what humans find hard.

What the Frontier Looks Like in Practice

It helps to look at concrete examples of where the jagged frontier shows up in real model behavior.

Areas Where Models Are Surprisingly Strong

Long-form writing and editing — Models can draft, revise, and adapt structured prose at a level that surprasses most human first drafts.
Code generation in common languages — Production-quality code in Python, TypeScript, and other well-represented languages is often correct on the first attempt for well-specified tasks. Claude Mythos’s SWE-Bench score of 93.9% reflects real capability gains in this area.
Summarization and synthesis — Distilling long documents into structured summaries is something models handle reliably.
Pattern recognition in structured data — Given well-formatted input, models can extract, classify, and organize information accurately.

Areas Where Models Are Surprisingly Weak

Counting and tracking — Models frequently miscounted letters in words for years, a failure that seemed embarrassing given their other capabilities.
Novel multi-step logical reasoning — Tests like the Pencil Puzzle Benchmark show models struggling with logical deduction chains that don’t map to memorized patterns.
Spatial and physical reasoning — Anything requiring a mental model of how objects move through space tends to expose sharp capability gaps.
Open-ended research-level problems — The Frontier Math Benchmark, which uses unpublished mathematical research problems, sees very low model performance despite strong scores on standard math benchmarks.
Real autonomous task completion — The Remote Labor Index found AI agents completing only 2.5% of real freelance work tasks autonomously, even as benchmark scores climbed.

Why Benchmarks Make the Frontier Hard to Read

Benchmarks are supposed to measure where the frontier is. In practice, they often obscure it.

When a benchmark becomes established and widely used, it attracts optimization pressure. Labs train on data distributions similar to the benchmark. Models get better at the benchmark faster than they get better at the underlying capability the benchmark is supposed to measure.

The Humanities Last Exam benchmark revealed a 21-point score inflation when independent testing removed contaminated questions. That’s not a minor rounding error — it’s the difference between thinking a model is highly capable in a domain and it actually being mediocre.

SWE-Rebench showed similar dynamics when decontaminated tests exposed significant inflation in reported coding scores. The frontier appeared further out than it actually was.

This creates a practical problem: when you deploy a model based on its benchmark performance, you might be deploying into territory the model doesn’t actually control. The confident scores were artifacts of optimization, not evidence of genuine capability.

The honest way to read the frontier is through benchmarks explicitly designed to resist gaming — novel problems, interactive tasks, decontaminated test sets, tasks that require genuine reasoning rather than pattern retrieval.

How Models Improve (and Why Improvement Is Uneven)

New model versions don’t push the frontier outward uniformly. Improvement concentrates in areas where:

There’s training pressure. Tasks that are benchmarked, requested often, or well-represented in RLHF feedback get better faster.
The capability is tractable at scale. Some things genuinely improve as compute and data increase. Others hit a ceiling from structural limitations.
Researchers focus effort. Coding capability improved dramatically partly because labs prioritized it and built specific training pipelines around it.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

This means the frontier can advance significantly in specific areas between model versions while staying flat or even regressing in others. A newer model isn’t automatically better at everything.

Claude Opus 4.5 reached what some called an agentic tipping point — but that was specifically in tool use and multi-step task completion, not a uniform capability jump across the board. The frontier advanced sharply in that direction.

It’s also worth noting that improvement in one area can come at cost in another. Fine-tuning a model toward specific behaviors can degrade general reasoning. Optimizing for helpfulness can reduce accuracy. The jagged shape isn’t static — it shifts as models change.

What This Means for Choosing and Using Models

Understanding the jagged frontier changes how you should think about model selection.

Don’t Treat Models as Uniformly Capable

A model that’s strong at creative writing may not be strong at structured data extraction. A model that excels at code generation may hallucinate in ways that matter on factual questions. Evaluating AI models for speed vs. quality is part of the picture — but evaluating them for the specific tasks you care about is more important.

Run your own tests on your actual use cases. Don’t rely on leaderboards for tasks that aren’t your tasks.

Use Multiple Models for Different Jobs

The jagged frontier is a strong argument for multi-model workflows. Route tasks to models that are strong in those specific areas rather than defaulting to one frontier model for everything.

AI model routers formalize this: they direct each task to the model best suited for it, which often means a smaller, faster, cheaper model for routine tasks and a larger model only when genuinely needed. This isn’t just about cost — it’s about capability routing. The jagged frontier means you can often get better results from a cheaper model on certain tasks than from an expensive frontier model.

Be Skeptical of Confident Wrong Answers

The most dangerous part of the jagged frontier isn’t the areas where models obviously fail. It’s the areas just inside the failure zone where models produce confident, plausible-looking output that happens to be wrong.

The HBS research found this specifically: consultants with AI assistance did worse on tasks where AI was outside its capability boundary because the AI produced confident-sounding wrong answers. AI agent failure modes often trace back to exactly this dynamic — the model knows enough to sound right but not enough to be right.

The practical implication: for high-stakes tasks, don’t just use AI — verify the output, especially in areas where you can’t easily spot errors yourself.

Understand That the Frontier Is Smoothing in Some Areas

There’s a separate but related idea worth knowing: AI capabilities are smoothing out for knowledge work. As models improve, the frontier for well-defined knowledge tasks is becoming more uniform. The gap between what a model can and can’t do in structured professional domains is narrowing.

This doesn’t mean the jagged frontier is disappearing — it means the frontier is advancing in ways that matter for office work, while remaining jagged in areas like physical reasoning, genuine novelty, and open-ended agentic tasks.

The Frontier in Agentic Workflows

The jagged frontier matters especially in agentic settings, where models aren’t just answering questions but taking sequences of actions to complete tasks.

In a single-turn question, a model failure is contained. In an agentic workflow, one failure can cascade. A wrong decision at step three can corrupt everything downstream. This makes the jagged shape of AI capability much higher stakes.

The sub-agent era responds to this partly by decomposing tasks — having specialized, smaller models handle discrete subtasks rather than relying on one model to handle everything. That’s a structural adaptation to the jagged frontier: if no single model is uniformly good at all tasks, design systems that route each subtask to the model with capability in that specific area.

Understanding which parts of your workflow fall inside the frontier and which don’t is the core design challenge for agentic systems.

How Remy Navigates the Jagged Frontier

Remy takes a practical response to the jagged frontier: don’t bet on one model being good at everything. Because the frontier is jagged, different tasks in a build pipeline benefit from different models.

When Remy compiles a spec into a full-stack app — backend, database, auth, frontend, tests — different jobs get routed to different models. Core agent reasoning uses Claude Opus for depth. Specialist tasks use Sonnet. Image generation and analysis go to models purpose-built for those tasks. The spec-as-source-of-truth architecture means that when a better model becomes available for a particular task, output improves without changing your spec.

This also means Remy doesn’t lock you into one model’s capability profile. As the frontier advances in specific areas, Remy can take advantage — and where a model is currently weak, the task goes somewhere else.

If you’re building applications and want to see this in practice, you can try Remy at mindstudio.ai/remy.

Frequently Asked Questions

What exactly is the jagged frontier in AI?

The jagged frontier is the uneven boundary of AI capability. Rather than being uniformly capable at everything up to a certain difficulty level, AI models excel at some hard tasks and fail at some easy ones. The “frontier” describes the outer edge of what a model can do — and “jagged” describes how irregular that edge is, jutting far forward in some areas and far back in others.

Why do AI models fail at tasks that seem simple?

Because AI difficulty doesn’t track human difficulty. Models learn from patterns in text and structured data. Tasks that are simple for humans but rarely appear in training data — or that require genuine world simulation rather than pattern matching — can be completely outside the model’s capability. Counting letters, spatial reasoning, and physical causality are common examples where models underperform despite excelling at harder-seeming tasks.

Does the jagged frontier get smoother as models improve?

In some areas, yes. For knowledge work — writing, coding, analysis — the frontier is becoming more uniform as models improve. But for tasks that require novel reasoning, genuine autonomy, or physical-world understanding, the frontier remains highly jagged. Progress is uneven, and new capability gaps emerge even as old ones close.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

How does the jagged frontier affect which AI model I should use?

It means you shouldn’t use a single model for all tasks. The right model for drafting marketing copy may not be the right model for analyzing financial data. Running your own evaluations on your specific use cases — rather than relying on general leaderboards — gives you a more accurate picture of where each model’s frontier actually sits for your needs.

Why do benchmark scores overstate AI capability?

Because benchmarks attract optimization pressure. Labs train models on data similar to known benchmarks, inflating scores on those specific tests without improving underlying capability. Benchmarks designed to resist gaming — novel problems, decontaminated questions, interactive tasks — consistently reveal lower capability than self-reported scores suggest. This is a direct consequence of the jagged frontier: the frontier advances where there’s measurement, while real capability in unmeasured areas lags behind.

What’s the practical risk of the jagged frontier for AI agents?

In agentic workflows, the risk is error propagation. A model failure in one step can cascade through subsequent steps. Because models are confidently wrong in areas just outside their capability boundary, detecting these failures before they compound is harder. Designing agentic systems with the jagged frontier in mind means building in verification steps, routing tasks to capable models, and not assuming a model that handles one part of a workflow well will handle adjacent parts at all.

Key Takeaways

The jagged frontier describes the uneven boundary of AI capability — models excel at some hard tasks while failing at seemingly simple ones.
This unevenness comes from training data distribution, benchmark saturation, emergent capabilities at scale, and the mismatch between human and AI difficulty.
Benchmark scores regularly overstate real capability because benchmarks attract optimization pressure. Independent tests on novel tasks consistently reveal gaps.
For practical use, the jagged frontier means: test models on your specific tasks, route different jobs to different models, and verify high-stakes outputs rather than trusting confident-sounding answers.
In agentic workflows, the jagged frontier is especially consequential — failures compound across steps in ways that don’t happen in single-turn interactions.
The frontier is smoothing for knowledge work but remains highly irregular for spatial reasoning, genuine novelty, and autonomous task completion.

Understanding the jagged frontier doesn’t make AI less useful — it makes you more effective at using it. Pick the right model for the right task, build verification into high-stakes workflows, and don’t let impressive benchmark scores convince you a model can handle everything. The frontier is always jagged somewhere. Knowing where it jags for your use case is what separates effective AI deployment from frustrated expectations.

Want to build applications that route tasks intelligently across the jagged frontier? Try Remy at mindstudio.ai/remy.

What Is the Jagged Frontier? Why AI Models Improve Unevenly

The Uneven Edge of AI Capability

Where the Concept Comes From

Other agents start typing. Remy starts asking.