Claude Outcomes Feature: How a Grading Agent Improved PowerPoint Quality by 10% Without Changing the Model
Anthropic's Outcomes adds a rubric-based grading agent that re-runs tasks if quality falls short — 10.1% better decks, no model swap.
Anthropic Added a Grading Agent to Claude — PowerPoint Quality Jumped 10% Without Touching the Model
Anthropic’s Outcomes feature improved PowerPoint generation quality by 10.1% on their internal benchmarks. Not by swapping in a better model. Not by rewriting prompts. By adding a second agent whose only job is to read the output and decide whether it’s good enough.
That’s the thing worth sitting with. The model didn’t change. The task didn’t change. A separate grading agent — scored against a rubric you write — was the entire delta.
This post is about how Outcomes works, why the architecture matters, and what it tells you about where agent quality control is heading.
What Anthropic Actually Shipped
Outcomes is part of Anthropic’s managed agents platform, which got a significant update at the Code with Claude developer event. The managed agents platform already provided sandboxing, state management, and error recovery. This update added three things: a memory system called Dreaming, multi-agent orchestration, and Outcomes.
Outcomes works like this. You write a rubric — a description of what a good output looks like for your specific task. When an agent finishes a task, a separate grading agent reads the output and scores it against that rubric. If the output doesn’t meet the quality threshold, the grading agent flags the specific problems and kicks the task back for another run. Webhooks notify you when the task finally completes.
Other agents start typing. Remy starts asking.
Scoping, trade-offs, edge cases — the real work. Before a line of code.
The benchmark numbers Anthropic published: 8.4% improvement in Word document quality, 10.1% improvement in PowerPoint quality. Both measured on Anthropic’s internal benchmarks.
The separation between the task agent and the grading agent is deliberate. The grading agent doesn’t have access to the task agent’s reasoning chain — it only sees the output. This matters because an agent that just reviewed its own work would be influenced by the same assumptions that produced the work in the first place. A fresh agent, looking only at the result against the rubric, catches different things.
Why This Architecture Is Interesting
The pattern of using a separate evaluator agent isn’t new. Multi-agent systems have used external graders for a while, mostly on coding tasks. A PR either passes tests or it doesn’t. The rubric is the test suite. The grading agent is the CI runner.
What’s different with Outcomes is the application to non-code knowledge work — documents, presentations, reports. These outputs don’t have a test suite. They have editorial standards, formatting conventions, audience expectations. The rubric has to carry that weight.
That’s a harder problem. “Does this PowerPoint have a clear narrative arc?” is not the same kind of question as “does this function return the right value?” You’re asking a language model to evaluate something subjective, against criteria that a human wrote in natural language.
The 10.1% improvement number suggests it’s working, at least by Anthropic’s internal measure. The caveat is that we don’t know exactly what those benchmarks are measuring or how they were constructed. But even a directional improvement of that size, from a purely architectural change, is worth paying attention to.
The Every/Spiral writing agent is a real-world example. Spiral is a tool built to make AI writing not suck — which is a harder problem than it sounds if you’ve tried using AI for anything beyond generic business prose. Every defined their own editorial rubric based on their writing standards and writer voice, and they’re using Outcomes to enforce that rubric on every agentic draft. The rubric is the editorial standard. The grading agent is the copy editor.
The Part That’s Easy to Miss
Here’s the non-obvious thing about Outcomes: the rubric is the hard part.
The grading agent is infrastructure. Anthropic built it so you don’t have to wire one up yourself. But the rubric is yours. And writing a rubric that actually captures what “good” means for your specific output is genuinely difficult.
Think about what a useful rubric for a financial pitch deck would need to cover. Structure: does it follow a logical flow from problem to solution to market to ask? Content: are the numbers sourced and internally consistent? Tone: does it match the audience (LP, strategic partner, acquirer)? Visual logic: does each slide have one clear point? That’s four different dimensions, each of which could be specified at varying levels of detail.
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
A vague rubric produces vague grading. “The output should be high quality” is not a rubric. “Each slide should have a single headline claim supported by no more than three data points, and the narrative should follow a problem-solution-evidence-ask structure” is closer to a rubric.
This is actually where a lot of the work is going to happen as teams adopt Outcomes. Not in configuring the grading agent — Anthropic handles that — but in articulating what good looks like precisely enough that a language model can evaluate against it reliably.
If you’re building agents that produce documents, presentations, or reports, the rubric is the thing you should be spending time on. The infrastructure is already there.
How This Fits Into the Broader Agent Quality Problem
One of the real challenges with agents right now is that they can produce a lot of output very fast, which means human review becomes a bottleneck. If your agent generates a 40-slide deck in three minutes, you can’t review every slide every time. You need some automated quality gate.
The standard approaches before something like Outcomes were: run the task once and hope, or build your own evaluation loop. Building your own evaluation loop means wiring up a second model call, writing the evaluation prompt, handling the conditional logic for re-runs, and managing the state across attempts. That’s not impossible, but it’s enough friction that most teams skip it.
Outcomes makes the grading agent part of the default setup. You write the rubric; Anthropic handles the rest. This is the same pattern as managed agents generally — taking things that were possible but required custom engineering and making them available as platform primitives.
The multi-agent orchestration piece connects here too. Managed agents now let a lead agent break a job into pieces and delegate to specialist sub-agents, each with their own model, prompts, and tools. Those sub-agents work in parallel on a shared file system, and the lead agent can check in mid-workflow. The whole thing is auditable in Claude Console — you can see what each sub-agent did and in what order, and read the reasoning behind each step.
Outcomes sits on top of this: once the multi-agent system produces output, the grading agent evaluates the final result against the rubric before delivery. The quality gate is at the end of the pipeline, not inside it.
For teams building multi-agent workflows with Claude Code, this is a meaningful addition. You can now define quality criteria once and have them enforced automatically across every run, without adding manual review steps.
The Open-Source Parallel
One thing worth acknowledging: this architecture has been running in open-source agent frameworks for a while.
Projects like Hermes and OpenBrain shipped working versions of eval substrates — essentially Outcomes-like grading loops — before Anthropic’s managed agents research preview. The open-source ecosystem has been ahead on agent primitives for roughly a year. Anthropic has the model capability advantage; the open-source side had the orchestration patterns first.
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
What Anthropic is doing with Outcomes (and with Dreaming, the memory management feature) is making these patterns accessible without requiring teams to build and maintain the infrastructure themselves. That’s a real value proposition, especially for teams that want to focus on the rubric and the task rather than the evaluation plumbing.
The tradeoff is configurability. A custom-built grading agent can be tuned in ways that a platform primitive can’t. For most teams, the platform version is probably the right starting point. For teams with very specific quality requirements — like Every’s editorial standards — the rubric itself becomes the customization surface.
What a Rubric Actually Looks Like in Practice
Since the rubric is where the real work is, here’s a concrete example of what one might look like for a financial model output.
A weak rubric: “The financial model should be accurate and well-formatted.”
A stronger rubric:
- Revenue projections must include a base case, upside case, and downside case
- Each assumption must be labeled with its source (management guidance, industry benchmark, or analyst estimate)
- The model must balance (assets = liabilities + equity) with no circular references
- Output format must match the company’s standard template with the correct fiscal year headers
- The executive summary tab must contain no more than five key metrics, each with a prior-period comparison
The stronger rubric gives the grading agent something to actually evaluate against. Each criterion is specific enough to be checkable. The grading agent can return a structured assessment: criterion 1 passed, criterion 3 failed (circular reference in cell D47), criterion 5 failed (six metrics present, not five).
This is also why the Claude Finance suite is interesting. Anthropic released 10 pre-defined agents — pitch builder, meeting preparer, market researcher, evaluation reviewer, month-end closer, and others — along with a full cookbook for customization. The cookbook presumably includes example rubrics for each agent type. That’s a useful starting point for teams building in financial services, because writing the first rubric from scratch is the hardest part.
The evaluation reviewer agent in particular is worth looking at. An agent whose job is to review evaluations — presumably investment memos, credit assessments, or similar — is a natural fit for Outcomes. The rubric for an evaluation reviewer is probably more developed than the rubric for a pitch builder, because evaluation frameworks in finance tend to be more standardized.
Connecting This to How Agents Get Built
The broader pattern here is that agent quality is increasingly a systems problem, not a prompting problem. You can spend a lot of time tuning the task agent’s prompt and get diminishing returns. Adding a grading agent with a well-specified rubric can produce a larger improvement with less effort — which is exactly what the 10.1% number suggests.
This connects to how teams are thinking about building multi-agent systems with persistent memory and self-improvement. The Dreaming feature (scheduled memory review between sessions) and Outcomes (quality gating at task completion) are two sides of the same problem: how do you make an agent system that gets better over time without requiring constant human intervention?
Dreaming handles the learning side — extracting patterns from past sessions and encoding them in orchestration memory. Outcomes handles the quality side — ensuring each output meets a defined standard before delivery. Together they sketch an architecture where agents improve across sessions (Dreaming) and maintain quality within sessions (Outcomes).
For teams building production agents, platforms like MindStudio offer a way to compose these kinds of multi-agent workflows visually — 200+ models, 1,000+ integrations, and a builder for chaining agents without writing the orchestration code from scratch. The rubric-and-grader pattern is something you can implement at the workflow level regardless of which infrastructure you’re running on.
The question of how to specify quality precisely enough for automated evaluation is also showing up in how people think about building applications from AI-generated output. Tools like Remy take a related approach at a different layer: you write a spec — annotated markdown where prose carries intent and annotations carry precision — and the full-stack application gets compiled from it. The spec is the source of truth; the generated code is derived output. The rubric in Outcomes plays a similar role: it’s the authoritative statement of what “correct” looks like, and everything else is evaluated against it.
What to Watch and What to Do
If you’re building agents that produce documents, presentations, or structured reports, Outcomes is worth testing now. The benchmark improvement is real, and the architecture is sound.
The practical steps:
Start with one output type. Pick the agent output where quality variance is most costly — the thing that, when it’s wrong, takes the most time to fix manually. Write a rubric for that output type first.
Be specific in the rubric. Each criterion should be checkable by a language model that has never seen your organization’s standards before. If a criterion requires context that isn’t in the rubric, add that context.
Look at the grading agent’s output, not just the final result. When the grading agent flags a problem and kicks the task back, read the flag. That’s signal about where your task agent is consistently falling short — and potentially where your rubric needs to be more precise.
The Claude Finance cookbook is a concrete resource if you’re building in financial services. The 10 pre-defined agents are deployable as plugins for co-work, Claude Code, or as managed agents, and the cookbook gives you the internals so you can modify them. Even if you’re not in finance, the rubric examples are worth reading as a model for how to specify quality criteria.
For teams already using parallel agent workflows with Claude Code, adding an Outcomes-style quality gate at the end of the pipeline is a natural next step. The infrastructure is there. The rubric is the work.
The 10.1% improvement came from adding one agent and writing one document. That’s a good ratio.