Code with Claude 2026: 5 New Agent Features Anthropic Just Shipped

Anthropic Just Shipped 5 New Agent Features at Code with Claude 2026

Anthropic held its Code with Claude developer event this week across San Francisco, London, and Tokyo — extended due to demand — and released no new models. That was a deliberate choice. The five features they shipped instead — Dreaming, Outcomes, multi-agent orchestration, Claude Finance with 10 pre-built agents, and Add-ins — are a more honest map of where the real competition in AI has moved.

If you’ve been watching the space, you already know the frontier model race has quieted relative to the harness race. Codex versus Claude Code is a more meaningful contest right now than GPT versus Opus. What Anthropic announced this week is their answer to the question: what does the scaffolding around a capable model actually need to do?

Here are the five features, what each one does concretely, and what they mean for builders.

Five Features, One Coherent Argument

The releases aren’t random. They cluster around three unsolved problems in production agent systems: memory degrades across sessions, output quality is hard to enforce without human review, and complex jobs require coordination between multiple agents. Anthropic shipped something for each of those problems, plus two more features aimed at enterprise deployment.

Dreaming: Scheduled Memory That Runs Between Sessions

The first feature is called Dreaming. Anthropic describes it as a scheduled process that reviews your agent sessions and memory stores, extracts patterns, and curates memories so your agents improve over time.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

The mechanics: Dreaming runs between sessions, not during them. It surfaces patterns a single agent can’t see on its own — recurring mistakes, workflows that agents converge on, preferences shared across a team. It also restructures memory so it stays high-signal as it evolves. The goal is for agents to not just complete tasks but to report what they learned while doing them, encoding those learnings into orchestration memory that gets preloaded the next time that agent runs.

The practical implication is that an agent system running Dreaming should get measurably better the longer it operates, without any manual intervention from the person who built it. Memories persist between sessions and the curation process is automated.

One thing worth knowing: this is not a novel concept. The open-source Hermes agent framework has offered cross-session memory, skill-building from experience, and scheduled review for close to a year. Anthropic’s contribution here is making this a managed default rather than something you have to wire together yourself. For teams without the engineering bandwidth to build their own memory substrate, that matters. For teams already running Hermes or similar, the question is whether Anthropic’s managed version is worth migrating to.

If you’ve been building persistent memory manually — writing session summaries to files, maintaining a memory.md pointer index, running your own compaction logic — the Claude Code memory architecture patterns are worth revisiting now that Dreaming exists as a platform-level alternative.

Outcomes: A Separate Grading Agent Enforces Quality

The second feature, Outcomes, addresses a different problem: how do you know when an agent’s output is actually good enough to deliver?

The mechanism is clean. You write a rubric describing what success looks like for a particular task. When the agent completes the task, a separate grading agent scores the output against that rubric. The separation is the key design choice — the grading agent hasn’t seen the task agent’s reasoning, so it’s evaluating the output on its own terms rather than being anchored to the process that produced it. If the output doesn’t meet the quality threshold, the grading agent highlights the issues and kicks the task back for another run. Webhooks notify you when the task is complete.

The benchmark numbers Anthropic published are specific: using Outcomes improved file generation quality by 8.4% for Word documents and 10.1% for PowerPoint slides on their internal benchmarks. No model change. Just adding a grading loop.

That 10.1% improvement from a structural change rather than a model upgrade is the most interesting data point in the whole release. It suggests that a meaningful fraction of output quality problems aren’t model problems — they’re evaluation problems. You’re not getting bad outputs because the model is bad; you’re getting bad outputs because nothing is checking them.

Every’s Spiral writing agent is already using this in production. Every defined their own rubric based on editorial standards and writer voice, and the Outcomes feature enforces it automatically before delivery. That’s the right use case: high-volume output where human review is a bottleneck and quality standards are definable in advance.

The harder version of this problem — subjective rubrics for knowledge work outputs — is less solved. Coding rubrics are easy: does the PR pass tests? Writing rubrics require you to articulate what “good” means, which is genuinely difficult. But the architecture is sound, and the discipline of writing the rubric is itself valuable.

Multi-Agent Orchestration: Lead Agent, Specialist Sub-Agents, Shared File System

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

The third feature is multi-agent orchestration at the managed agents level. A lead agent breaks a job into pieces and delegates each piece to a specialist sub-agent with its own model, prompts, and tools. The sub-agents work in parallel on a shared file system, feeding results back into the lead agent’s context. The lead agent can check in on sub-agents mid-workflow to verify they’re on track.

The whole system is auditable in Claude Console. You can see what each sub-agent did, in what order, and inspect the reasoning behind task execution decisions.

The example Anthropic gave: a lead agent runs an investigation while sub-agents fan out through deploy history, error logs, metrics, and support tickets simultaneously. That’s a real pattern for engineering incident response, and it’s the kind of workflow that previously required custom orchestration code to build.

For builders who’ve been constructing multi-agent systems manually — routing between models, managing shared state, handling failures — this is Anthropic absorbing a layer of infrastructure that used to be your problem. The multi-agent workflow patterns for Claude Code post covers the underlying patterns in more depth if you want to understand what the orchestration layer is actually doing.

If you want to build something like this without writing the orchestration code yourself, platforms like MindStudio handle this kind of composition: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows across a multi-agent setup.

Claude Finance: 10 Pre-Built Agents for Financial Services

The day before the main event, Anthropic shipped Claude Finance — a package of 10 predefined agents targeting financial services workflows. The agents include a pitch builder, meeting preparer, market researcher, evaluation reviewer, and month-end closer, among others.

Each agent can be deployed three ways: as a plugin for co-work or Claude Code, or as a managed agent. Anthropic also released a full cookbook so teams can understand how each agent works and modify it as needed.

The positioning is “starter pack.” Rather than requiring a financial services firm to build custom agents from scratch, Anthropic is giving them working implementations they can fork and adapt. The cookbook is the right move here — it lowers the barrier to customization without hiding how the system works.

The new connectors released alongside Claude Finance are worth noting separately: Dun & Bradstreet for business identity verification, Fiscal AI for market analysis, and Verisk for insurance underwriting. These are industry-specific data sources that make the agents actually useful for production financial workflows rather than demos.

The commentary that Anthropic “killed another wave of AI startups” with this release is mostly wrong. These agents are going after low-skill repetitive knowledge work — the kind of thing that was already semi-automated through traditional software or outsourced to junior staff. They’re not attacking high-skill financial analysis. The pitch builder isn’t replacing a senior banker; it’s replacing the analyst who spends three hours formatting a deck.

Add-ins: Claude Works Inside the Software, Not Alongside It

The fifth feature is Add-ins, which allows Claude to work directly within productivity software rather than accessing it through MCP or a connector.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The concrete example: instead of Claude accessing Microsoft Word via an external connection, Claude works inside Word. That means it has access to software-native context — your company’s document templates, linked spreadsheets for financial models, existing formatting conventions. The difference between “Claude can read your Word file” and “Claude is working in Word the way a human would” is meaningful for document-heavy workflows.

This is the feature that’s hardest to evaluate from the outside without using it. The value depends entirely on how well the software-native context actually improves output quality. But the design principle is right: agents that understand the environment they’re operating in produce better results than agents that treat every task as a blank-slate text generation problem.

For builders thinking about how this connects to the broader question of going from a spec to a deployed application, tools like Remy take a related approach: you write an annotated markdown spec as the source of truth, and it compiles into a complete TypeScript stack — backend, database, auth, deployment. The spec carries the intent; the generated code is derived output. Different layer, same underlying idea that the context you bring to generation matters more than most people account for.

What’s Buried in the Announcement

The most underreported part of the Code with Claude event wasn’t any of the five features. It was Boris Churnney, the creator of Claude Code, saying in a panel discussion that there is literally no manually written code anywhere in Anthropic anymore. Clouds coordinate with each other over Slack, code in loops, and resolve issues across the codebase.

That’s not a marketing claim. That’s the creator of the tool describing how the company that built it actually uses it. Churnney also pushed back on the term “vibe coding” — he thinks it undersells what’s actually happening. Andrej Karpathy has suggested “agentic engineering” as a replacement, but Churnney isn’t sold on that either.

The framing matters for how you think about these five features. Dreaming, Outcomes, and multi-agent orchestration aren’t features for people who want to dabble with AI. They’re infrastructure for teams running agents at the scale where memory management, quality enforcement, and parallel execution are real operational problems. Anthropic is building for the version of software development that Churnney is describing — where agents coordinate with each other and the human’s job is to define the goal and review the output, not to write the code.

The roadmap items Anthropic teased reinforce this: higher judgment and code taste, context windows that “feel infinite” (research head of product Diane Penn’s precise phrasing), and improved multi-agent coordination. The infinite context framing is careful — Penn didn’t say infinite, she said feel infinite. That’s probably compaction getting good enough that the seams disappear, not a fundamental research breakthrough. But the functional outcome is the same: agents that can maintain coherent context across very long-running tasks.

What to Do With This This Week

If you’re building with Claude Code today, the most immediately actionable thing is to look at your existing multi-agent workflows and ask whether you’re doing memory management and quality review manually that Dreaming and Outcomes could handle for you. If you are, the managed agents platform is worth testing — not because it’s necessarily better than what you’ve built, but because maintaining that infrastructure yourself has a cost.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

For the parallel branch workflows pattern specifically, multi-agent orchestration with a shared file system is a direct upgrade path. Sub-agents working in parallel on a shared file system is exactly the architecture that makes parallel feature development tractable at scale.

If you’re in financial services and have been putting off building agent workflows because the setup cost was too high, the Claude Finance cookbook is worth reading this week. The 10 pre-built agents give you a working reference implementation even if you end up building something custom.

The one opinion I’ll offer: the Outcomes feature is the most underrated thing in this release. A 10.1% improvement in output quality from adding a grading loop — with no model change — is a strong signal that most teams are leaving quality on the table not because their models are bad but because they have no systematic way to check outputs before delivery. Writing a rubric is hard work, but it’s the kind of hard work that compounds. Every/Spiral figured this out for writing quality. The same pattern applies to any high-volume output workflow where you can define what “good” looks like in advance.

The self-improving skills pattern with AutoResearch is worth reading alongside Dreaming — the combination of scheduled memory review and automated quality improvement is essentially what Anthropic is now offering as a managed service.

Anthropic held no model releases at Code with Claude 2026. That was the right call. The bottleneck for most production agent systems right now isn’t model capability — it’s the infrastructure around the model. These five features are Anthropic’s answer to that bottleneck.