Claude Outcomes Feature Improved PowerPoint Quality 10.1%: How Rubric-Grading Agents Work
Anthropic's Outcomes feature uses a separate grading agent to score and re-run tasks. It lifted PowerPoint generation quality 10.1% on internal benchmarks.
A Separate Agent Scored Your Output and Kicked It Back for a Redo
Anthropic’s Outcomes feature improved PowerPoint generation quality by 10.1% and Word document quality by 8.4% on internal benchmarks. That’s not a model upgrade. No new weights, no architecture change. The improvement came entirely from adding a second agent whose only job is to read the output and grade it against a rubric you wrote.
That’s the core idea behind Outcomes, and it’s worth understanding precisely because the mechanism is simple enough that you could build something like it yourself — and because the benchmark numbers suggest it actually works.
Here’s how it works, why the separation between the task agent and the grading agent matters, and what this tells you about where multi-agent quality control is heading.
What Outcomes Actually Does
When you use Outcomes, you write a rubric. This is just a description of what “good” looks like for your specific task. For a report generation agent, that might be: “The executive summary must be under 200 words, reference at least two data sources, and avoid passive voice.” For a slide deck, it might be: “Each slide must have a single claim in the title, supporting evidence in the body, and no more than four bullet points.”
Once your task agent finishes its work, that output gets handed to a separate grading agent. The grading agent reads the output, reads your rubric, and scores how well the output matches. If the score is below threshold, the grading agent can flag the specific issues and kick the task back for another run.
The webhooks piece matters here: you don’t have to sit and watch. Anthropic added webhook support so you get notified when the task is complete — meaning the whole loop (generate, grade, regenerate if needed, notify) can run unattended.
The 8.4% and 10.1% improvements are Anthropic’s internal benchmark numbers. They’re not third-party verified, and “quality” is doing some work in that sentence. But the direction is consistent with what multi-agent system designers have observed informally for a while: having a second pass with a fresh context catches things the first pass misses.
Why the Separation Between Agents Is the Whole Point
You might wonder: why not just have the task agent review its own output? Claude is already capable of self-critique. Why add a second agent?
The answer is context contamination. When an agent generates output, it has a full context window of reasoning, intermediate steps, and decisions that led to that output. When it reviews its own work, it’s not looking at the output fresh — it’s looking at it through the lens of everything it already decided. It tends to rationalize rather than evaluate.
The grading agent starts with a clean slate. It sees only the output and the rubric. It has no attachment to the reasoning that produced the output. That independence is what makes the grade meaningful.
This is the same reason code review works better when someone other than the author does it. The author knows what they meant to write. The reviewer only knows what’s actually there.
Anthropic’s implementation makes this separation structural, not just recommended. The grading agent is a distinct agent with its own model, prompts, and context. It’s not a second prompt appended to the same conversation.
The Non-Obvious Part: This Is the First Time It’s Been Applied to Knowledge Work
Most external grading agents deployed so far have been applied to code. That makes sense: code has objective correctness criteria. A PR either passes the unit tests or it doesn’t. The grading rubric writes itself.
Applying rubric-based grading to non-code knowledge work — Word documents, PowerPoint slides, written reports — is a different problem. “Good writing” is not a binary. “Effective slide structure” is not a unit test. You’re asking the grading agent to evaluate against subjective criteria that you, the user, have to articulate.
That’s actually harder than it sounds. Writing a rubric that’s specific enough to be useful but general enough to apply across varied outputs is a skill. Anthropic is betting that users can develop that skill, and that the 10.1% improvement in PowerPoint quality suggests the bet is paying off even with imperfect rubrics.
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
The Every/Spiral writing agent is the clearest example of this in production. Every built Spiral to make AI-generated writing not default to generic business prose — which is a real problem if you’ve tried to use AI for anything with a distinct voice. Their implementation uses a multi-agent system across several Claude models, and they’ve now integrated Outcomes with an editorial rubric based on their own writing standards and voice guidelines. The rubric enforces quality before the draft reaches a human editor. That’s the whole value proposition of their product, so the rubric has to be good.
How This Fits Into the Broader Managed Agents Picture
Outcomes didn’t ship in isolation. It’s part of a larger set of managed agent features Anthropic announced at their Code with Claude developer event.
The initial managed agents launch in April gave you a sandbox, state management, error recovery, and cloud computer access — the infrastructure layer. The new features are the quality and memory layer on top of that.
Dreaming is the memory piece: a scheduled background process that reviews past agent sessions, extracts patterns, and restructures memory so it stays useful over time. The idea is that agents should get better the longer they run, not just execute the same way every time.
Multi-agent orchestration is the coordination piece: a lead agent breaks a job into pieces and delegates to specialist sub-agents, each with their own model, prompts, and tools. The sub-agents work in parallel on a shared file system, and their work feeds back into the lead agent’s context. The whole thing is auditable in the Claude console.
Outcomes sits in the quality layer: it’s the mechanism that checks whether the output of any of these agents — lead or sub — actually meets your standard before it reaches you.
If you’re building multi-agent systems, the pattern to internalize is: generate → grade → regenerate if needed → deliver. That loop is now a first-class primitive in the managed agents platform, not something you have to wire together yourself.
For teams building these kinds of systems from scratch, multi-agent workflow patterns in Claude Code covers several approaches to structuring agent coordination — including how to think about task decomposition before you even get to quality control.
What You Need to Write a Good Rubric
The rubric is where most implementations will succeed or fail. Anthropic’s system is only as good as the criteria you give the grading agent.
A few things that make rubrics work better in practice:
Be specific about format, not just quality. “Well-written” is not a rubric criterion. “The introduction must state the main finding in the first sentence” is. The grading agent can check the second one; it can only guess at the first.
Include negative criteria. “No passive voice in the executive summary” is easier to grade than “active voice preferred.” The grading agent can scan for passive constructions. It can’t easily score a preference.
Separate must-haves from nice-to-haves. If some criteria are hard requirements (the output must include a data table) and others are stylistic preferences (the tone should be conversational), make that distinction explicit. Otherwise the grading agent will treat them as equivalent.
Test your rubric on known-good and known-bad examples. Before you run it in production, generate a few outputs manually, grade them yourself, then run the grading agent and see if it agrees. If it doesn’t, the rubric needs work.
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
The Claude Finance suite that Anthropic released alongside these features is a useful reference point. The 10 predefined agents — pitch builder, meeting preparer, market researcher, evaluation reviewer, month-end closer, and others — come with a cookbook that shows how the agents are structured. That cookbook is worth reading even if you’re not in financial services, because it shows how Anthropic’s own team thinks about task decomposition and quality criteria for knowledge work.
Building Your Own Version of This Pattern
You don’t need managed agents to implement the generate-grade-regenerate loop. The pattern is straightforward enough to build with any orchestration layer.
The basic structure:
- Task agent generates output
- Grading agent receives output + rubric, returns a score and a list of issues
- If score is below threshold, pass the issues back to the task agent with instructions to fix them
- Repeat until score is above threshold or you hit a maximum iteration count (important: always set a max)
- Deliver the final output
The tricky part is step 3. You want to pass the grading agent’s specific feedback, not just “try again.” The more specific the feedback, the more targeted the revision. “The executive summary is 340 words, needs to be under 200” is actionable. “The quality is insufficient” is not.
For teams building this kind of orchestration visually rather than in code, MindStudio lets you chain agents and models — including the task agent and grading agent as separate nodes — with 200+ model options and 1,000+ integrations, which means you can wire the grading loop to whatever delivery mechanism you’re using without custom API work. If you want to go further and compile the entire grading workflow into a deployable app from a spec, Remy is MindStudio’s spec-driven full-stack app compiler: you write a markdown spec with annotations describing your generate-grade-regenerate loop, and it compiles into a complete TypeScript app with backend, database, auth, and deployment already handled.
The iteration count limit deserves emphasis. Without it, a poorly written rubric or an edge-case input can send the system into an infinite loop. Set a hard maximum — three to five iterations is usually enough — and have the system deliver the best output it achieved along with a flag indicating it didn’t fully meet the rubric. That’s more useful than a timeout error.
If you’re running these loops inside Claude Code, the AutoResearch self-improving skills pattern shows a related approach: using binary feedback signals to automatically improve prompt quality over time, which is essentially what a rubric-grading loop does at the session level.
The Open-Source Ecosystem Already Had This
One thing worth saying plainly: the generate-grade-regenerate pattern is not new. Multi-agent system designers have been using external grading agents for coding tasks for over a year. The Hermes agent framework has had persistent cross-session memory and self-improvement loops for months before Anthropic shipped Dreaming.
Jeten Gar put it directly: “The open-source agent ecosystem is leading on primitives… The closed labs have raw model capability. The open source ecosystem has agent primitives. Those are different layers.”
What Anthropic is doing with Outcomes isn’t inventing the pattern. It’s making the pattern the default. If you’re a non-technical user building a report generation agent, you shouldn’t have to know that external grading agents exist, figure out how to prompt one, and wire it into your workflow. Outcomes handles that behind the scenes.
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
That’s a real contribution, even if it’s not a research breakthrough. Defaults matter. Most users will never build a custom grading agent, but they’ll use Outcomes because it’s there.
The benchmark numbers — 8.4% for Word docs, 10.1% for PowerPoint — are Anthropic’s way of saying: the default is good enough to be worth turning on. Whether those numbers hold in your specific use case depends entirely on how well your rubric matches what you actually care about.
What to Watch and What to Try
The most interesting near-term question is how rubric quality evolves. Right now, writing a good rubric is a skill that most users don’t have. As more teams use Outcomes, you’ll start to see shared rubrics, rubric templates for common document types, and probably tooling that helps you generate rubrics from examples.
The Every/Spiral implementation is worth watching specifically because their rubric is doing real editorial work — enforcing voice and style, not just format. If that approach generalizes, you’ll see Outcomes used not just for quality control but as a way to encode institutional knowledge about what “good” means for a specific organization.
For now, the practical starting point is simple: pick one document type you generate repeatedly, write a rubric with five to ten specific, checkable criteria, and run it through Outcomes for a week. The 10.1% improvement number is an average across Anthropic’s internal benchmarks. Your number will be different. The only way to find out what it is for your use case is to measure it.
If you want to build the grading loop yourself before committing to managed agents, building a multi-agent team with Paperclip and Claude Code shows how to structure lead and sub-agent relationships in a way that maps directly to the generate-grade-regenerate pattern. And for thinking about how to give your agents persistent memory between sessions — the other half of the quality problem — building an AI second brain with Claude Code and Obsidian covers the session-to-session learning side of the equation.
The rubric is the hard part. The infrastructure is increasingly handled for you.