Claude Opus 4.7 Review: What Actually Changed and What Got Worse

The Short Version

Claude Opus 4.7 is a meaningful step forward for agentic coding workflows. It fixes a persistent issue with multi-step task completion, posts real gains on SWE-Bench and HumanEval, and introduces a redesigned tokenizer that improves multilingual handling. But the same tokenizer bumps token counts by roughly 12–18% on typical workloads, which means you’re paying more per task even if the model performs better. And web research quality — one area where Opus 4.6 was genuinely strong — has taken a measurable hit.

This is not a clean upgrade for every use case. Whether it’s worth switching depends entirely on what you’re using it for.

For a full overview of what Anthropic positioned this release as, see what Opus 4.7 is and why Anthropic built it this way. This review focuses on what actually changed in practice.

What’s New in Opus 4.7: The Core Changes

Anthropic made four categories of changes with this release:

Agentic persistence improvements — better task continuation across long tool-use sequences
Coding capability gains — higher scores on SWE-Bench, HumanEval, and internal agentic coding evals
A new tokenizer — redesigned vocabulary with better multilingual coverage
Vision updates — improved chart and document parsing accuracy

The vision improvements are real but incremental. If your use case involves extracting structured data from financial documents or parsing complex charts, the vision changes in Opus 4.7 are worth reading about separately.

The bigger story is the first three items — and how they interact in ways Anthropic didn’t fully advertise.

Agentic Persistence: The Fix That Actually Matters

The most significant improvement in Opus 4.7 is not a benchmark number. It’s a behavioral fix that anyone running multi-step agents has been waiting for.

Opus 4.6 had a known issue: in long agentic sequences — especially those involving repeated tool calls, filesystem operations, or web fetches — the model would sometimes abandon subtasks partway through. It would declare completion before actually completing the task, or drop a branch of reasoning without flagging the failure.

This wasn’t a hallucination problem in the traditional sense. The model knew what it was supposed to do. It just stopped doing it. Anthropic described this internally as a “persistence deficit” in long-horizon tasks.

Opus 4.7 fixes this in a substantial way. In internal testing on multi-step coding workflows, task abandonment rates dropped by roughly 60% compared to 4.6. The model is now significantly better at maintaining goal state across tool-call sequences, re-attempting failed steps, and distinguishing between “I can’t do this” and “this step failed and I should try a different approach.”

This matters a lot for agentic coding use cases where Opus is operating autonomously. If you’ve been working around the 4.6 persistence issues with custom prompting or manual checkpointing, those workarounds are mostly unnecessary now.

What Changed Under the Hood

The persistence improvement appears to come from two sources: additional training on long-horizon agentic trajectories, and a revised system prompt architecture that gives the model better access to its own task state. The latter means the model can more reliably track what it’s done, what it hasn’t done, and what failed — without that tracking eating into useful context.

This is different from just making the model more stubborn. A model that won’t give up is not always a good model — sometimes stopping and flagging is the right call. Opus 4.7 is better at the distinction: it abandons tasks less, but when it does abandon, the failure reporting is more useful.

Coding Benchmarks: What the Numbers Actually Show

Opus 4.7 posts clear improvements on standard coding benchmarks:

SWE-Bench Verified: up from 71.2% (4.6) to 78.9%
HumanEval: up from 88.3% to 91.7%
MBPP: modest improvement, from 87.1% to 89.4%

These are real gains, not margin-of-error noise. The SWE-Bench improvement in particular is meaningful — that benchmark is specifically designed for realistic software engineering tasks, not isolated function completions. An 8-point jump on SWE-Bench reflects the same persistence improvements described above, combined with better multi-file reasoning.

For a full breakdown across coding, vision, and financial analysis tasks, the Opus 4.7 benchmark analysis covers each category in detail.

A Note on Benchmark Interpretation

These are Anthropic’s self-reported figures. Independent testing tends to show tighter margins than vendor numbers. If you care about accuracy here, benchmark gaming is a real issue with self-reported AI scores, and it’s worth treating any single vendor’s numbers as a ceiling rather than a ground truth.

The functional coding improvements are real — they show up consistently in developer testing outside of formal evals. But the magnitude may be closer to 5–6 points real-world versus 7–8 points on the published benchmarks.

Where Opus 4.7 Regressed: Web Research Quality

Here’s the part Anthropic didn’t lead with.

Opus 4.6 was exceptionally good at web research tasks — synthesizing information across multiple sources, identifying contradictions between sources, and producing coherent summaries with accurate attribution. This was one area where it pulled ahead of GPT-5 variants in independent testing.

Opus 4.7 is noticeably weaker here. The degradation shows up in a few specific ways:

Source synthesis accuracy is down. The model more frequently attributes claims to the wrong source when pulling from multiple documents.
Contradiction detection has declined. When two sources say conflicting things, 4.7 is more likely to blend them into a single “both are true” response rather than flagging the inconsistency.
Citation specificity has dropped. Responses reference sources less precisely, with fewer direct quotes and more paraphrasing that drifts from the original.

This likely reflects a training tradeoff. The improvements to agentic persistence and long-horizon task completion required additional training data focused on tool-use and code trajectories — and that appears to have shifted the model away from some of the careful, cross-referential reasoning that made 4.6 strong at research tasks.

If your primary use case is web research or document analysis rather than coding, comparing 4.6 and 4.7 directly is worth doing before you commit to upgrading. For real-time search specifically, other models may now be more competitive.

The New Tokenizer: Better Coverage, Higher Costs

Opus 4.7 ships with a redesigned tokenizer. The main motivation was multilingual improvement — the previous tokenizer was significantly less efficient at handling non-Latin scripts, particularly Mandarin, Japanese, Korean, Arabic, and Hindi. For those languages, the new tokenizer reduces token counts by 20–35%, which is a meaningful cost reduction for non-English workloads.

For English, the math goes the other way.

The new tokenizer is slightly less efficient with English text. Typical English prompts and completions run 12–18% longer in token count compared to the Opus 4.6 tokenizer. On a single request, this is noise. Across thousands of API calls, it adds up fast.

If you’re running Opus 4.7 at scale on English-language tasks, expect your token costs to be meaningfully higher than with 4.6 — even if the per-token price stays the same. This is not always obvious in initial testing because small-volume tests don’t surface the cumulative effect.

Understanding how token-based pricing actually works is useful context here. The price per token hasn’t changed, but the token count per task has. That’s functionally equivalent to a price increase for English-dominant workloads.

There are ways to mitigate this. Multi-model routing — using Opus for complex reasoning steps and a lighter model for simpler tasks — becomes more important at 4.7’s token efficiency profile. If you’re not already routing selectively, now is a good time to start.

How Opus 4.7 Sits in the Broader Landscape

Against GPT-5.4

GPT-5.4 is stronger on web research and source synthesis — the exact areas where Opus 4.7 regressed. For coding, the gap has narrowed in Opus 4.7’s favor, but GPT-5.4 is still more consistent on refactoring large codebases with complex dependencies. Neither model is a clean winner. Your use case determines which one is actually better for you.

A direct benchmark comparison across Opus 4.7, GPT-5.4, and Gemini 3.1 Pro breaks this down across specific task categories if you want the full picture.

Against Claude Mythos

Anthropic’s most capable model, Claude Mythos, posts an 93.9% on SWE-Bench Verified — roughly 15 points above Opus 4.7. Mythos is demonstrably better at complex reasoning, long-context understanding, and agentic tasks. It’s also substantially more expensive.

The question of what Anthropic is holding back with Opus 4.7 vs. Mythos is worth reading if you’re deciding whether Mythos is justified for your workload, or if 4.7 covers your needs at a lower cost.

Against the Rest

For teams considering open-weight alternatives, Qwen 3.6 Plus has emerged as a credible competitor in agentic coding at significantly lower cost. It doesn’t match Opus 4.7 on complex multi-file tasks, but for more contained coding workflows, the gap is smaller than it was six months ago.

How This Affects Remy

Remy uses Claude Opus as its core agent for complex reasoning and spec compilation, with Sonnet handling specialist subtasks. The Opus 4.7 improvements to agentic persistence are directly relevant here — the same failure modes that affected 4.6 in long coding sessions also affected spec-to-code compilation for larger applications.

With 4.7, Remy handles longer, more complex specs with fewer mid-process failures. The model is better at tracking which parts of a spec have been compiled, which haven’t, and where it needs to revisit earlier decisions based on later constraints.

The tokenizer cost increase is managed through the same multi-model routing that Remy already uses — Opus handles the reasoning-intensive steps, lighter models handle formatting and boilerplate, so the real-world cost impact is lower than it would be for a single-model setup.

If you’re building full-stack applications from specs and want to see how this plays out in practice, try Remy at mindstudio.ai/remy.

Who Should Upgrade

Upgrade if:

You’re running agentic coding workflows and have hit the 4.6 persistence issues
Your workload is primarily English-language coding, not research or document synthesis
You work in non-English languages — the tokenizer improvement is substantial for you
You’re building multi-step tool-use pipelines that require reliable task completion

Don’t upgrade (yet) if:

Web research, source synthesis, or document analysis is your primary use case
You’re cost-sensitive on English-language API usage and don’t have routing in place
Your current 4.6 setup is working well and you don’t hit the persistence issues

Check the migration guide first if:

You have custom system prompts tuned for 4.6 behavior — some of the behavioral changes in 4.7 can interact unexpectedly with prompts that were written around 4.6 quirks. The migration guide from 4.6 to 4.7 covers the specific changes to watch for.

Frequently Asked Questions

Is Claude Opus 4.7 better than Opus 4.6?

It depends on the task. For agentic coding and multi-step tool-use, yes — 4.7 is meaningfully better. For web research, source synthesis, and document analysis, 4.6 is actually stronger. The new tokenizer also makes 4.7 more expensive per English-language task, which is a real consideration for high-volume API usage.

Why does Claude Opus 4.7 cost more if the per-token price is the same?

The new tokenizer in Opus 4.7 is less efficient with English text than the tokenizer in 4.6. The same prompt and completion will use 12–18% more tokens on average. Since you pay per token, the effective cost per task goes up even though the listed price per token hasn’t changed. Non-English workloads benefit from the opposite effect — the new tokenizer is significantly more efficient for non-Latin scripts.

Does Opus 4.7 fix the task abandonment bug from 4.6?

Yes, substantially. The model is roughly 60% less likely to drop subtasks in long agentic sequences compared to 4.6. It’s not a complete elimination of the behavior — edge cases still exist — but the improvement is large enough that most users running long-horizon agentic workflows will notice a real difference.

How does Opus 4.7 compare to Claude Mythos?

Claude Mythos is Anthropic’s most capable model and scores roughly 15 points higher than Opus 4.7 on SWE-Bench. Mythos is better across nearly every task category, but it costs significantly more. Opus 4.7 is the practical choice for most production workflows where Mythos-level capability isn’t strictly necessary.

What happened to web research quality in Opus 4.7?

Opus 4.7 regresses on web research tasks compared to 4.6. Source attribution accuracy dropped, contradiction detection is weaker, and citation specificity is lower. This likely reflects a training tradeoff — improving agentic persistence required training data that shifted the model away from the careful cross-referential reasoning that made 4.6 strong on research tasks. If research quality matters to you, 4.6 is still the better choice for that specific use case.

Should I wait for a future model instead of upgrading to 4.7?

If coding performance and agentic reliability are your priorities, 4.7 is a real improvement worth moving to now. If you’re hoping for a model that fixes both agentic coding and web research simultaneously, what Anthropic is working toward with Mythos suggests that’s coming — but not at Opus pricing anytime soon.

Key Takeaways

Agentic persistence is the headline improvement — long-horizon task completion is significantly more reliable in 4.7
Coding benchmarks improved meaningfully, with SWE-Bench up nearly 8 points from 4.6
Web research quality regressed — this is a real tradeoff, not a minor footnote
The new tokenizer costs you more on English workloads — budget for 12–18% higher token counts if you’re not routing selectively
Multilingual users benefit — non-Latin scripts are 20–35% more token-efficient with the new tokenizer
Upgrade if coding is your core use case; stay on 4.6 if research is

If you’re evaluating which model to use as the backbone for a production agentic workflow, the 2026 guide to agentic workflow models puts Opus 4.7 in context alongside the full competitive landscape. And if you’re building full-stack applications where the underlying model quality matters directly to your output, try Remy — it runs on the best available Opus model and handles the routing tradeoffs automatically.