Anthropic Dev Day: 6 New Managed Agent Features That Change How Claude Handles Long-Running Work

Anthropic Just Shipped 6 Managed Agent Features. Here’s What Each One Actually Does.

Anthropic held its Code with Claude developer day, and there was no new model. No Opus 4.8 teaser, no Mythos rollout, no benchmark drop to argue about on Twitter. What you got instead were six concrete additions to the managed agents platform — Dreaming, Outcomes, multi-agent orchestration, Claude Finance, new connectors, and add-ins for Microsoft apps — that together sketch out what Anthropic thinks the real competition in 2026 is actually about.

That competition isn’t model vs. model anymore. It’s harness vs. harness. And if you’re building anything that runs Claude for longer than a single prompt, these features are worth understanding in detail.

Here’s what shipped, what each thing does, and why the specifics matter.

The Agent That Forgets Everything Is a Broken Agent

The first feature, and the one that got the most attention from builders, is called Dreaming.

Anthropic describes it as “a scheduled process that reviews your agent sessions and memory stores, extracts patterns, and curates memories so your agents improve over time.” The key word is scheduled. This isn’t memory that accumulates passively in a context window. It’s a background job that runs between sessions, looks across multiple runs, and restructures what the agent knows so it stays high-signal as it evolves.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

The practical problem Dreaming solves is one anyone who’s built a multi-session agent has hit: the agent does the same dumb thing twice. It makes a mistake in session one, you correct it, and then in session three it makes the same mistake again because nothing from session one persisted in a useful form. Dreaming is meant to close that loop — surfacing recurring mistakes, workflows the agent converges on, and preferences shared across a team.

Yan Cronberg put it plainly: “Agents that learn from past sessions and iterate until they hit quality enough is the architecture most teams have been trying to build manually. Dreaming seems to be the missing piece to that puzzle.”

The honest caveat here is that this isn’t new territory. Jeten Gar, who tracks the open-source agent ecosystem closely, noted that “the open-source agent ecosystem is leading on primitives… The closed labs have raw model capability. The open source ecosystem has agent primitives.” Hermes, the open-source agent framework, has had scheduled memory review, cross-session persistence, and skill extraction from experience for months. What Anthropic is doing is making that architecture the default — you don’t have to wire it up yourself, it’s just there when you spin up a managed agent.

That’s not a knock. Default availability changes adoption curves dramatically. But builders coming from OpenClaw or Hermes should know they’re not seeing a research breakthrough here. They’re seeing a productization of something the open-source side already proved out.

If you’ve been building your own memory layer for Claude — the kind of self-evolving memory system using Obsidian and Claude Code hooks that some builders have assembled manually — Dreaming is Anthropic’s answer to that problem at the infrastructure level. For teams who want to go further and compile that memory architecture into a deployable app, Remy is worth knowing about: it’s MindStudio’s spec-driven full-stack app compiler where you write a markdown spec with annotations and it compiles into a complete TypeScript app — backend, database, auth, and deployment included — which makes it a natural fit for teams who want to operationalize agent memory patterns without hand-wiring the infrastructure.

The Grading Agent That Checks the Work Before You See It

The second feature is Outcomes, and it’s the one with the most interesting benchmark attached to it.

Here’s how it works: you write a rubric describing what success looks like for a given task. When the agent completes the task, a separate grading agent scores the output against that rubric. If the output doesn’t pass, the grading agent flags the issues and kicks the task back for another run. Anthropic also added webhooks so you get notified when the task is actually done — not just when the agent thinks it’s done.

The separation between the task agent and the grading agent is the key design decision. The grading agent doesn’t see the reasoning chain that produced the output. It only sees the output itself, scored against the rubric. That removes a failure mode where the grading agent gets anchored to the task agent’s logic and approves something it shouldn’t.

The numbers Anthropic published: using Outcomes improved file generation quality by 8.4% for Word documents and 10.1% for PowerPoint slides on their internal benchmarks. Those aren’t enormous numbers, but they’re meaningful for a feature that requires zero additional model capability — just a second pass with a rubric.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

What’s significant about this is the domain. External grading agents have been standard practice in coding workflows for a while. A PR either passes the unit tests or it doesn’t. The rubric is objective. Applying the same pattern to non-code knowledge work — documents, presentations, reports — is less developed territory. Coding rubrics are well-defined. “Does this PowerPoint meet our editorial standards?” is not. The fact that Anthropic is seeing measurable quality gains on subjective output is the interesting signal here.

Every’s Spiral writing agent is the clearest real-world example. Spiral is a tool built specifically to make AI writing not sound like AI writing, which is a genuinely hard problem. Every uses a multi-agent system across several Anthropic models for cost optimization, and now they’ve plugged in Outcomes with an editorial rubric based on their own writing standards and voice guidelines. The rubric enforces quality before the draft ever reaches a human editor. That’s the whole product for them — if the rubric doesn’t hold, Spiral doesn’t work.

One Agent That Runs the Room, Others That Do the Work

The third feature is multi-agent orchestration, which is the most architecturally significant addition even if it’s the least flashy to describe.

The setup: a lead agent breaks a job into pieces and delegates each piece to a specialist sub-agent with its own model, prompts, and tools. The sub-agents work in parallel on a shared file system. Their outputs feed back into the lead agent’s context. The lead agent can check in on sub-agents mid-workflow to make sure they’re still on track.

Anthropic’s example is a debugging investigation: the lead agent runs the investigation while sub-agents fan out through deploy history, error logs, metrics, and support tickets simultaneously. The whole thing is trackable in Claude Console — you can see what each sub-agent did, in what order, and inspect the reasoning behind each step.

This is the architecture that makes long-horizon tasks tractable. A single agent with a single context window hits limits fast on complex jobs. A lead agent that can delegate, parallelize, and synthesize doesn’t have the same ceiling. If you’ve been building this kind of structure manually — the multi-agent team patterns using Paperclip and Claude Code that require careful orchestration setup — managed agents now handles the scaffolding.

The auditability piece matters more than it might seem. One of the real blockers for enterprise adoption of agents isn’t capability, it’s accountability. When something goes wrong, someone needs to be able to explain what happened. The ability to trace every sub-agent action in Claude Console gives teams a paper trail. That’s not a technical feature — it’s a compliance feature.

MindStudio handles this kind of orchestration across 200+ models with 1,000+ pre-built integrations and a visual builder for chaining agents — which is useful context for understanding what “managed” means at different levels of the stack. Anthropic is building the infrastructure layer; what you compose on top of it is still your problem.

A Starter Pack for Financial Services

Before dev day even started, Anthropic shipped Claude Finance — a suite of 10 predefined agents aimed at financial services firms.

The agents include a pitch builder, meeting preparer, market researcher, evaluation reviewer, and month-end closer, among others. They can be deployed as plugins for Co-work or Claude Code, or run as managed agents. Anthropic also released a full cookbook so teams can understand how each agent works and modify it.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

The framing here is “starter pack.” Financial services firms have been building custom agents for these workflows anyway. What Anthropic is offering is a baseline that’s already been designed for the domain, with the option to customize rather than build from scratch. That’s a different value proposition than selling raw API access.

Alongside Claude Finance, Anthropic added three new connectors for industry-specific data: Dun & Bradstreet for business identity, Fiscal AI for market analysis, and Verisk for insurance underwriting. These aren’t generic connectors — they’re targeted at the specific data sources that financial services workflows actually depend on. A market research agent that can’t pull from a real market data provider isn’t useful. These connectors close that gap.

The commentary that these agents are “killing AI startups” misreads what’s happening. These agents are going after low-skill, repetitive knowledge work — the kind of thing that was already semi-automated through traditional software or outsourced to junior staff. They’re not touching high-skill financial analysis. The pitch builder isn’t replacing the banker who structures the deal. It’s replacing the analyst who spends three hours formatting the deck.

Claude Inside Your Apps, Not Just Beside Them

The fifth and sixth additions are quieter but worth flagging: add-ins for Microsoft productivity apps, and the infrastructure changes that were already in place from the April managed agents launch.

The add-ins feature means Claude can work directly inside Word, PowerPoint, Excel, and Outlook — not through an MCP connector or a separate window, but natively within the application. The practical difference is context. When Claude is working inside Word, it has access to your company’s document templates. When it’s inside Excel, it can see the linked spreadsheets. That software-native context is what makes the difference between an agent that produces generic output and one that produces output that fits your actual workflow.

The cross-app memory piece is also live: as Claude moves between tasks across Microsoft apps, it keeps the full context of the conversation. You can draft an email in Outlook, jump to Word to write a document that references that email, and Claude carries the thread.

The April managed agents launch — which added sandbox environments, state management, error recovery, and cloud computer access — is the foundation all of this runs on. Instead of giving your agent a local machine to work from, you spin up a cloud instance on Anthropic’s infrastructure. That’s what makes the multi-agent orchestration and Dreaming features possible at scale.

What the Research Head Hinted At

One more thing from dev day that didn’t make the feature list but belongs in any honest account of the event: Anthropic’s research head of product, Diane Penn, gave a brief look at what’s coming in future models.

Three things: higher judgment and code taste, multi-agent coordination improvements, and — the one that got the most attention — “infinite context windows that feel infinite.” Penn’s precise phrasing was deliberate. The word infinite was in quotation marks. The framing was context windows that feel infinite, not a claim about literal unbounded context.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

The speculation that followed was mostly about whether this is sophisticated compaction — the process by which a harness compresses a filling context window and preserves the important parts — or something more fundamental. Penn didn’t say. But the direction is clear: Anthropic is working on agents that don’t lose the thread on long-running work, which is the same problem Dreaming is attacking from the memory side.

Boris Chernny, the creator of Claude Code, added a data point that’s hard to dismiss: “There is literally no manually written code anywhere in the company anymore.” Anthropic’s own internal workflows now run on agents coordinating over Slack, coding in loops, and resolving issues across the codebase. That’s not a marketing claim — it’s a description of how the company that built these tools actually uses them. If you’re thinking about what agentic workflow patterns for Claude Code look like at scale, that’s the reference point.

Chernny also publicly disavowed the term “vibe coding” at the event, saying it undersells what’s actually happening. Andrej Karpathy has suggested “agentic engineering” as a replacement. Chernny isn’t sold on that either and is taking suggestions. The naming debate is trivial, but the underlying point isn’t: the way developers use AI has changed enough that the old vocabulary doesn’t fit anymore.

What This Adds Up To

The absence of a model release at Code with Claude was the most telling thing about the event.

Anthropic has a strong model. The competition for the next phase isn’t about who has the slightly better benchmark score — it’s about who has built the infrastructure that makes agents actually work in production. Memory that persists and improves. Quality review that doesn’t require a human in the loop. Orchestration that scales across parallel workstreams. Domain-specific agent suites that give enterprises a starting point.

Dreaming, Outcomes, multi-agent orchestration, and Claude Finance are Anthropic’s answer to the question of what comes after “the model is good enough.” The answer, apparently, is: build the harness.

The open-source ecosystem got there first on some of these primitives — Hermes had scheduled memory review before Anthropic shipped Dreaming as a managed feature. But there’s a difference between a primitive that a skilled builder can wire up and a default that any developer gets when they spin up a managed agent. Anthropic is betting that the second one is what enterprise adoption actually requires.

Whether that bet pays off depends on execution. The features are real. The benchmarks — 10.1% improvement on PowerPoint generation quality, 8.4% on Word docs — are modest but measurable. The infrastructure is in place. What happens next is whether the harness proves out in the workflows that matter.

If you’re building on Claude and you haven’t looked at managed agents since the April launch, the platform is meaningfully different now. The Claude Code memory architecture that builders have been assembling manually has a managed counterpart. The multi-agent patterns that required careful orchestration setup are now part of the default stack.

That’s not nothing. It’s actually the whole point.