What Is the Jagged Frontier? Why AI Capabilities Are Smoothing Out for Knowledge Work
The jagged frontier model assumed AI was great at some tasks and terrible at others. Learn why that's changing and what it means for how you deploy AI agents.
The Research That Defined How We Think About AI Limitations
In 2023, a team of researchers from Harvard Business School partnered with Boston Consulting Group to run one of the most rigorous real-world tests of AI capability to date. They took 758 BCG consultants — some of the most analytically capable knowledge workers in the world — and divided them into groups. Some worked with GPT-4. Some didn’t. Then they measured the results across 18 different business tasks.
The headline findings were striking. Consultants using AI were 12.5% more likely to complete tasks successfully, finished work 25.1% faster, and produced output rated 40% higher quality by evaluators. For certain tasks, the performance gap was even more pronounced.
But buried in that research was a finding that received less attention at the time, and that turned out to be more important for how we actually use AI at work: for tasks that fell outside of the AI’s capability zone, consultants who used AI actually performed worse than those who didn’t use it at all.
That asymmetry — AI dramatically boosting performance on some tasks while silently degrading it on others — is what the researchers called the jagged frontier.
Understanding the jagged frontier, where it came from, and why it’s changing is now one of the most practically useful things a business leader or knowledge worker can know. It shapes which AI tasks are safe to automate, which require human oversight, and how to think about the AI agents you’re deploying across your organization.
What the Jagged Frontier Actually Means
The term comes from a 2023 paper by Fabrizio Dell’Acqua, Edward McFee, Hila Lifshitz-Assaf, Katherine Kellogg, and colleagues, titled “Navigating the Jagged Technological Frontier.” It uses a spatial metaphor to describe something real and counterintuitive about AI capabilities.
Imagine mapping every possible work task on a flat plane. Now imagine drawing a boundary — a frontier — around all the tasks an AI system can do well. Inside the frontier, the AI performs impressively. Outside it, the AI struggles or fails.
Here’s the problem: that boundary isn’t a smooth circle or a clean curve. It’s jagged. Irregular. Unpredictable. A task that seems complex might sit well inside the frontier. A task that seems trivial might sit just outside it.
Why “Jagged” Is the Right Word
The jaggedness is the key insight. If AI capabilities formed a smooth gradient — where harder tasks were always harder for AI — you could develop a simple mental model and apply it reliably. You’d delegate the easy stuff to AI and keep the hard stuff for humans.
But that’s not how it works. The frontier juts inward and outward in ways that don’t map neatly onto human intuitions about task difficulty.
Some examples from the BCG study and related research:
- GPT-4 could produce sophisticated market analysis and ideation at a level that impressed professional evaluators.
- The same model would fail on tasks requiring precise causal reasoning or tasks embedded in real-world operational constraints.
- It excelled at synthesizing information across documents but struggled with tasks that required maintaining consistent logical constraints over multi-step reasoning.
The practical danger here is that the failure mode isn’t obvious. AI-generated output on tasks outside the frontier often looks good. It reads fluently. It has the right format. But it’s subtly wrong — and workers who trust it perform worse than workers who never used AI at all.
The Trust Trap
This is what the BCG researchers flagged as the most serious risk: not that AI fails visibly, but that it fails in ways that look like success.
Knowledge workers who were given AI assistance on tasks outside the frontier tended to accept the output rather than question it. The fluency of the text, the confidence of the response, the apparent coherence of the reasoning — all of these create a cognitive bias toward acceptance.
The researchers called this “falling asleep at the wheel.” You hand the task to AI, the AI produces something plausible, and you move on without catching the error. You’d have caught the error if you’d done the task yourself.
This is why understanding the jagged frontier isn’t just an academic exercise. It has direct consequences for how much you should trust AI output, when you need human review, and how you structure AI-assisted workflows.
What Made the Frontier Jagged: The Technical Roots
To understand why AI capabilities were so uneven in 2022–2023, and why that’s changing, you need to understand what caused the jaggedness in the first place.
Training Data Distribution
Large language models learn from text. The distribution of that text has massive effects on what they can and can’t do.
Tasks that appear frequently in human-written text — summarizing arguments, drafting emails, explaining concepts, writing code in popular languages — are tasks the model has seen thousands of versions of during training. The model has implicitly learned the structure of good output for these tasks.
Tasks that are rare in text — precise numerical reasoning, tracking multi-step logical dependencies, tasks requiring knowledge of specific operational contexts — are tasks where the model has less signal. Its performance becomes less reliable.
This creates capability peaks around text-heavy, language-centric work, and valleys around tasks that require types of reasoning less naturally expressed in written form.
Reasoning vs. Pattern Matching
Early large language models were essentially very sophisticated pattern matchers. They were good at producing outputs that looked like what follows from what came before, based on patterns in training data.
This works well for a lot of tasks. Drafting a proposal, summarizing a meeting, writing marketing copy — these tasks involve producing output that follows recognizable patterns from existing good examples.
But it breaks down on tasks that require genuine reasoning: holding a chain of logic together, checking consistency across multiple constraints, catching when a stated premise contradicts an implicit assumption.
The original GPT-4 was notably capable but had an irregular relationship with multi-step reasoning. It could perform impressively on problems that looked hard but had pattern-heavy solutions, while stumbling on problems that looked simple but required careful sequential reasoning.
Context Window Constraints
In 2022 and early 2023, context window limitations were a significant source of jaggedness. Models could only process a certain amount of text at once — typically 4,000 to 8,000 tokens.
This created hard cutoffs on certain tasks. Analyzing a long contract, synthesizing a large research corpus, maintaining consistency across a long document — all of these fell outside the frontier not because of inherent model limitations, but because of architectural constraints.
Tasks that happened to fit within the context window could be handled well. Tasks that exceeded it couldn’t. The boundary was technical, not intuitive, which is part of what made the frontier feel so arbitrary.
Hallucination Patterns
Hallucination — the tendency of models to produce confident-sounding but false information — wasn’t uniform across task types. It was much more likely in certain domains than others.
Tasks requiring specific factual recall (dates, statistics, citations, technical specifications) had high hallucination risk. Tasks involving general reasoning or synthesis had lower risk.
Again, this created a frontier that didn’t map to human intuitions. A human expert might assume AI would be more reliable on tasks with clear, factual answers and less reliable on complex synthesis. In practice, it was often the reverse.
Where the Frontier Was and Wasn’t: Concrete Examples
Looking at where the jagged frontier sat in practice for knowledge work tasks helps illustrate why organizations needed to think carefully about which tasks they delegated to AI.
Tasks Inside the Frontier (circa 2022–2023)
These were the areas where early enterprise AI adoption showed real results:
Drafting and editing: Summarizing documents, drafting initial versions of emails, proposals, and reports, suggesting edits to existing writing. These tasks were strongly inside the frontier.
Ideation and brainstorming: Generating options, suggesting alternatives, producing lists of ideas. Models were good at this, often surprisingly good — the BCG study found AI-assisted ideation outperformed unassisted work substantially.
Code assistance: Writing code in common languages (Python, JavaScript, SQL), explaining what existing code does, suggesting fixes for common errors. This was inside the frontier, which drove massive early adoption among developers.
Summarization: Condensing long documents, extracting key points, producing structured summaries from transcripts or reports. Models handled this well when input fit in the context window.
Translation and reformatting: Converting content between formats, translating between languages, restructuring information. These were reliable AI tasks.
Tasks Outside the Frontier (circa 2022–2023)
These were areas where AI assistance degraded rather than improved outcomes:
Multi-step causal reasoning: Tasks that required building a chain of causes and effects and checking consistency throughout. Models would often drop a constraint somewhere in the chain.
Tasks with precise numerical requirements: Calculations, financial modeling, tasks where numbers needed to be exactly right. Models could make arithmetic errors that looked plausible.
Operational judgment with real-world constraints: Tasks like “given these specific constraints in our system, what should we do?” — where the answer required integrating real-world context the model didn’t have.
Tasks requiring consistency across a long document: Maintaining a consistent voice, checking for contradictions, ensuring arguments held together over many pages.
Novel domains with limited training data: Specialized technical fields, proprietary internal contexts, very recent events.
The Dangerous Middle Ground
The most practically important category was tasks that sat right at the frontier’s edge — tasks where AI performance was inconsistent, where it sometimes worked well and sometimes failed, but where you couldn’t easily tell which situation you were in.
This is where the “falling asleep at the wheel” problem was most acute. If AI always failed on a task, you’d learn quickly not to trust it. If AI always succeeded, you could rely on it. The dangerous cases were the ones where AI produced high-quality output 80% of the time and subtly wrong output 20% of the time — and you couldn’t distinguish which was which from the outside.
Why the Frontier Is Smoothing Out
Here’s where the story changes. The jagged frontier wasn’t a permanent feature of AI systems — it was a feature of AI systems at a particular point in time. And that point has passed.
The frontier is still there, but it’s smoothing. The peaks and valleys are becoming less extreme. Tasks that were reliably outside the frontier in 2022 have moved inside it. The practical implications are significant.
Reasoning Models Change the Picture
One of the most consequential shifts has been the development of reasoning-first models. OpenAI’s o1 and o3 series, Anthropic’s Claude 3.7 Sonnet with extended thinking, and Google’s Gemini 2.0 Flash Thinking represent a different approach to how models process problems.
Rather than immediately producing output, these models are trained to reason through problems step by step before answering. This internal reasoning process — sometimes called a “chain of thought” — catches logical errors, checks consistency, and handles multi-step problems more reliably.
The impact on the jagged frontier is real. Tasks that previously required careful human oversight — particularly anything involving multi-step reasoning, logical consistency, or complex inference — have moved closer to or inside the frontier for reasoning-capable models.
Performance on competitive mathematical benchmarks like AIME (American Invitational Mathematics Examination) illustrates the shift. Early GPT-4 solved a small fraction of AIME problems. Reasoning models now solve a significant majority. These are problems that require sustained, error-checked logical reasoning — exactly the type of task that sat outside the earlier frontier.
Context Windows Are No Longer the Bottleneck
Context window sizes have expanded dramatically. Gemini 1.5 Pro shipped with a 1 million token context window. Claude’s models support up to 200,000 tokens. GPT-4o and its successors handle much more than earlier versions.
This has effectively moved entire categories of tasks inside the frontier. Analyzing a long contract, maintaining consistency across a lengthy document, synthesizing a large research corpus — these are now tractable for AI in ways they simply weren’t in 2022.
The frontier hasn’t just shifted. In some areas, it’s jumped forward substantially.
Tool Use and External Verification
Modern AI systems don’t just generate text — they can call tools, run calculations, search for information, execute code, and verify their own outputs.
This matters for the jagged frontier because many tasks were outside the frontier not because of fundamental reasoning limitations but because models were working with no ground truth. They’d hallucinate a statistic because they had no way to look it up. They’d make an arithmetic error because they had no calculator.
Models with tool use can delegate the parts they’re bad at (exact calculation, current fact retrieval) to systems that handle those parts well (calculators, search engines, code interpreters). The result is that composite AI systems — models with tools — have a meaningfully smoother frontier than standalone text generation.
The research on tool-augmented language models consistently shows this: models with access to external tools outperform equivalent models without them on tasks requiring precise factual or computational accuracy.
Multimodal Capabilities Fill Previous Gaps
Earlier models were purely text-based, which meant tasks involving images, diagrams, charts, or other visual content were entirely outside the frontier. You couldn’t ask AI to analyze a chart, read a screenshot, or understand a diagram.
Multimodal models have changed this. Vision capabilities are now standard in frontier models, and they’re becoming reliable rather than experimental. This has moved a substantial category of knowledge work tasks — anything involving documents with visual elements, data visualizations, presentations, physical space analysis — into the accessible range.
Better Calibration Through Training
Model training has also improved in ways that affect the jagged frontier less directly but meaningfully. Models are better calibrated — more likely to express uncertainty when uncertain, less likely to produce confident-sounding wrong answers.
This doesn’t eliminate hallucination, but it changes the nature of the risk. When a model says “I’m not confident about this” or “you should verify this claim,” it’s giving you a signal to apply more scrutiny. Better calibration means the model’s expressed confidence is more informative — the failure mode of confidently wrong output has decreased.
Agentic Systems with Self-Correction
One of the most significant developments is the shift from single-turn AI queries to agentic systems that operate over multiple steps.
An AI agent tasked with producing an analysis doesn’t just generate one response — it can plan, gather information, produce a draft, check its own work, identify gaps, and revise. This iterative self-correction process catches errors that would have slipped through in a single-pass generation.
The jagged frontier for a single-pass GPT-4 query is not the same as the jagged frontier for a well-designed agentic workflow using the same underlying model. The agent architecture effectively smooths the frontier by compensating for model weaknesses through process.
What Frontier Smoothing Means for Knowledge Work
The smoothing of the frontier has practical implications that organizations are only starting to reckon with. Some of the key shifts:
Advice That Was Right in 2023 May Be Wrong Now
Organizations that evaluated AI tools in 2022 or 2023 and decided certain tasks weren’t suitable may be working from outdated information. The frontier has moved.
This doesn’t mean you should abandon critical review. But it does mean that blanket policies like “don’t use AI for X” need to be revisited regularly. What sat outside the frontier eighteen months ago may be inside it now.
The pace of capability improvement has been fast enough that annual re-evaluation of which tasks are appropriate for AI assistance is probably a minimum. Quarterly is better for organizations where AI plays a significant role.
Human Oversight Requirements Are Shifting, Not Disappearing
As more tasks move inside the frontier, the nature of human oversight changes. You’re less focused on “catch the mistake that AI makes on tasks it can’t do” and more focused on “verify that the AI’s reasoning on tasks it can do is aligned with your actual goals.”
This is a subtler form of oversight. It requires humans who understand what good output looks like, not just humans who can spot obvious errors.
The goal is what researchers sometimes call “calibrated trust” — trust that’s proportional to actual reliability, updated as you learn more about how AI performs in your specific context.
Task Classification Becomes More Important
Even with a smoother frontier, not all tasks are inside it. And the tasks still outside the frontier may be consequential ones — high-stakes decisions with real-world consequences, tasks requiring specialized knowledge the model doesn’t have, tasks embedded in complex organizational context.
Getting task classification right — knowing which tasks AI can handle reliably, which need oversight, and which shouldn’t involve AI — becomes more important as you deploy AI more broadly. The classification itself needs to evolve as the frontier moves.
The Risk Profile Changes with Deployment at Scale
At the level of an individual worker using AI occasionally, the jagged frontier matters but the stakes are bounded. One flawed AI output causes one problem.
At the level of an organization deploying AI agents that process thousands of tasks autonomously, the stakes are different. If AI is handling a task that’s 97% inside the frontier and 3% outside it, those 3% failures happen frequently when you’re operating at scale.
Frontier smoothing matters at scale precisely because it changes the math. Moving a task from 80% reliable to 99% reliable is the difference between constant human intervention and rare exception handling.
The “Surprisingly Good at Easy, Surprisingly Bad at Hard” Pattern Is Still Present
Even as the frontier smooths, some form of jaggedness persists. The pattern has shifted — what counts as “hard” for AI has changed — but the fundamental dynamic remains.
Current frontier models can write sophisticated legal memos but may still struggle with tasks that require precise operational context that isn’t in their training data. They can analyze complex financial models but may make undetected errors in highly specialized domains.
The lesson from the original jagged frontier research hasn’t expired. It’s been updated. You still need to understand where your AI system’s reliable zone ends, and you still need verification processes for tasks near that edge.
How Organizations Should Think About AI Deployment Now
The jagged frontier’s evolution changes how smart organizations approach AI deployment. Here’s what updated thinking looks like.
Map Your Tasks, Not Just Your Tools
Most AI adoption conversations start with tools: “Which AI platform should we use? Which model is best?” But the more important question is task mapping.
Before deploying AI for a category of work, map the specific tasks involved and assess where they sit relative to current AI capabilities. This isn’t a one-time exercise. It needs to be repeated as models improve.
Useful dimensions for task assessment:
- Verification difficulty: How easy is it to check the AI’s output? Is the correctness of the output obvious, or does checking require deep expertise?
- Error tolerance: What happens when the AI is wrong? Are errors recoverable, or are they high-consequence?
- Context requirements: Does the task require knowledge of your specific operational context, or does it work from general knowledge?
- Reasoning depth: Does the task require sustained logical inference, or is it more synthesis and generation?
Design for the Remaining Jaggedness
Even with a smoother frontier, your deployment architecture should account for the tasks that remain unreliable. Practically, this means:
Build in verification steps for high-consequence outputs. Don’t route AI output directly to external actions (sending emails, making decisions, publishing content) without a review step for tasks near the frontier edge.
Use confidence signals where models provide them. Modern models are reasonably good at expressing uncertainty. Where they flag low confidence, treat that as a trigger for human review.
Log and monitor outputs so you can identify patterns in AI failures. If you’re seeing errors in a particular task type, that’s data about where your frontier sits in practice.
Test systematically rather than relying on anecdotal experience. If you’re deploying AI for a high-volume task, run structured evaluations before full deployment to understand the actual reliability level.
Think in Terms of Human-AI Systems, Not AI Replacement
The BCG research is often cited for its headline finding that AI boosted performance. But the more enduring insight is about how performance was boosted — by changing how humans and AI work together.
The consultants who did best with AI weren’t the ones who maximally delegated to AI. They were the ones who understood what AI was good at, leveraged those capabilities, and applied their own judgment where AI was unreliable.
As the frontier smooths, the optimal human-AI collaboration pattern shifts. But there will likely always be a frontier edge where human judgment is the right check on AI output. Designing for that is more robust than designing for AI autonomy across the board.
Update Your Mental Models Regularly
One of the most consequential differences between 2022 and today is the pace of change. In a slower-moving technology environment, you could build a mental model of “what AI can do” and rely on it for years.
That’s no longer true. Mental models about AI capability that are more than six months old are probably outdated in significant ways. Organizations that build in regular AI capability reviews — and that treat “AI can’t do X” as a hypothesis rather than a fact — are better positioned to capture gains as the frontier moves.
Building AI Agents on a Smoother Frontier with MindStudio
The smoothing of the jagged frontier has direct implications for how you should architect AI agents for business use. When AI was more jagged, the safe approach was narrow agents — AI systems scoped to specific tasks well within the frontier, with humans handling everything outside it.
A smoother frontier opens up a different kind of architecture: broader agents that can handle more task types reliably, with targeted human-in-the-loop steps for the areas where reliability still matters.
This is exactly the territory MindStudio was built for. MindStudio is a no-code platform for building and deploying AI agents — not simple automation triggers, but agents that can reason and act across multiple steps. You can build agents that draft, analyze, verify, and route work, pulling in 200+ AI models depending on what a given task needs.
The multi-model approach matters here. Different models have different frontier shapes. Claude’s reasoning strengths differ from GPT-4o’s. Gemini 2.0 Flash handles certain workloads better than others. An agent that can route tasks to the right model — choosing a reasoning model for logic-heavy steps and a faster model for straightforward generation — is exploiting the best of each model’s frontier rather than accepting the limitations of any single one.
MindStudio’s 1,000+ integrations mean these agents can pull in the external tools that effectively smooth the frontier further. An agent that can search the web, run calculations, pull from a database, and verify its own output is significantly more capable than one operating purely on model inference. This is the tool-augmented approach that research consistently shows improves reliability on tasks near the frontier edge.
For teams actively thinking about where to apply AI agents — and where human oversight still belongs — MindStudio’s visual builder makes it practical to add review steps, branch logic based on confidence signals, and route exceptions for human handling without writing code.
You can try it free at mindstudio.ai.
Frequently Asked Questions
What is the jagged frontier in AI?
The jagged frontier is a concept from a 2023 Harvard Business School and Boston Consulting Group study. It describes how AI capabilities aren’t uniformly distributed across tasks — there’s a boundary separating tasks AI handles well from tasks it handles poorly, and that boundary is irregular and hard to predict from intuition alone. Tasks that seem complex might be inside the frontier; tasks that seem simple might fall outside it. The danger is that AI often fails silently on tasks outside the frontier, producing output that looks plausible but contains errors that degrade overall performance.
Who coined the term “jagged frontier”?
The term was coined by researchers at Harvard Business School, primarily Fabrizio Dell’Acqua and colleagues, in their 2023 paper “Navigating the Jagged Technological Frontier.” The research was conducted with Boston Consulting Group using 758 professional consultants as subjects, making it one of the most rigorous real-world studies of AI performance in knowledge work contexts.
Is the jagged frontier still relevant in 2024 and 2025?
Yes, but the frontier has shifted substantially. Reasoning models, longer context windows, tool use, and agentic architectures have moved many tasks that were outside the frontier in 2022–2023 to inside it. The frontier still exists — AI still has blind spots and unreliable zones — but the peaks and valleys are less extreme. The core lesson of the original research (understand where AI is reliable and where it isn’t, and design accordingly) remains valid even as the specific capabilities have changed.
How does the jagged frontier affect AI agents specifically?
AI agents compound the frontier issue because they operate autonomously across multiple steps. If a task is outside the frontier and produces a bad output, and that output feeds into the next step of an agent workflow, errors can propagate and amplify. This is why task assessment — knowing which steps in a workflow are inside the frontier — is especially important for agentic deployments. The good news is that agentic architectures with self-correction loops, tool access, and verification steps effectively smooth the frontier by compensating for single-step model weaknesses.
How do I know if a task is inside or outside the AI frontier for my use case?
There’s no universal answer because the frontier depends on the specific model, the specific task, and the specific context. Practical approaches include: structured evaluation before deployment (test AI output against known-correct answers for a sample of your tasks), checking whether AI expresses uncertainty on task outputs (modern models are reasonably well-calibrated), looking at error patterns in logged outputs over time, and consulting model benchmarks that test capabilities relevant to your task types. The key is treating frontier assessment as an ongoing process rather than a one-time determination.
Why did AI users sometimes perform worse than non-AI users in the BCG study?
When consultants used AI for tasks outside the frontier, they tended to accept AI output rather than scrutinize it — a dynamic the researchers called “falling asleep at the wheel.” The AI produced fluent, plausible-sounding output that didn’t obviously signal its incorrectness. Without AI, the same consultants would have completed the task themselves and caught errors in the process. With AI, they delegated and missed the errors. This “automation complacency” effect means that blind trust in AI output for tasks outside the frontier can actively worsen performance relative to not using AI at all.
What’s the difference between a jagged frontier and just saying “AI has limitations”?
Saying AI has limitations is true but not very useful — everything has limitations. The jagged frontier concept is more specific and more actionable: it says the limitations don’t follow a predictable pattern based on task difficulty as humans perceive it. This means you can’t rely on your intuition about which tasks are “too hard” or “simple enough” for AI. You need empirical assessment of specific task types, because AI’s capability profile cuts across human intuitions in non-obvious ways. The concept also highlights the asymmetric risk: tasks inside the frontier improve performance; tasks outside it can degrade it.
Key Takeaways
The jagged frontier model was a genuine insight when it emerged in 2023 — a research-backed framework for understanding why AI boosted performance on some tasks and hurt it on others. But it was always a snapshot of capabilities at a specific moment, not a permanent law of AI behavior.
The frontier is smoothing. Here’s what that means in practice:
- Reasoning models, expanded context windows, and tool-augmented AI have moved significant categories of tasks inside the frontier that were outside it as recently as two years ago. Organizations working from 2022–2023 AI evaluations are probably being more conservative than current capabilities require.
- The core lesson still applies: where AI is reliable, it improves performance substantially. Where it isn’t, it can degrade performance through over-trust. The locations have changed; the dynamic hasn’t.
- Agentic architectures effectively smooth the frontier by combining model reasoning with external tools, self-correction loops, and verification steps. A well-designed AI agent is more reliable than the underlying model on its own.
- Regular re-evaluation is now mandatory. Treating “AI can’t do X” as a permanent conclusion is a fast way to fall behind. What was true in 2023 may not be true in 2025.
- Deployment design matters as much as model selection. Human-in-the-loop steps, confidence-based routing, and output verification are the practical tools for managing the remaining jaggedness.
If you’re building AI agents for business use and want a platform designed for this kind of thoughtful deployment — one that gives you the flexibility to combine models, add verification steps, and control where AI operates autonomously and where humans stay in the loop — MindStudio is worth looking at. You can start building for free at mindstudio.ai.