Anthropic Managed Agents vs. OpenBrain Open-Source: Did Hermes Ship This First?

Open Source Beat Anthropic to the Punch — Now What Do You Build On?

Hermes and OpenBrain shipped working production versions of persistent cross-session memory and eval-based quality loops nearly a year before Anthropic’s managed agents research preview. That’s the uncomfortable fact sitting underneath all the Code with Claude announcements. If you’re deciding right now whether to build on Anthropic’s managed agent infrastructure or stay with the open-source stack, you’re not choosing between mature and experimental — you’re choosing between two different maturity profiles, with different tradeoffs on control, cost, and how much plumbing you want to own.

This matters because the decision compounds. Agent infrastructure is not easy to swap out once your workflows depend on it. The memory architecture, the eval substrate, the orchestration model — these become load-bearing walls fast.

The Actual Gap Between What Shipped and When

Start with what Anthropic announced at Code with Claude, because the specifics matter.

Dreaming is a scheduled process that reviews agent sessions and memory stores, extracts patterns, and curates memories between sessions. The framing is that it surfaces things a single agent can’t see on its own — recurring mistakes, converging workflows, preferences shared across a team. Memories persist between sessions and are supposed to automatically improve agent performance the longer the system runs.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Outcomes lets you write a rubric for what success looks like. A separate grading agent scores the output against that rubric. The separation is deliberate — the grading agent doesn’t see the task agent’s reasoning, only the output. If quality falls short, it can kick the task back for another run. Anthropic’s internal benchmarks showed 8.4% improvement in Word document quality and 10.1% improvement in PowerPoint quality — without any model change. That last part is the interesting result: the same model, better output, just because you added a grader.

Multi-agent orchestration on the managed agents platform lets a lead agent break a job into pieces and delegate to specialist sub-agents, each with its own model, prompts, and tools. They work in parallel on a shared file system. The whole thing is auditable in Claude Console.

Now here’s where the timeline gets interesting. As commentator Jeten Gar noted in the discourse around the announcement: “The open-source agent ecosystem is leading on primitives. Noose research with Hermes for orchestration, OpenBrain for personal memory and eval substrates. These projects shipped working production systems before Anthropic shipped a research preview of similar functionality.” The closed labs have raw model capability. The open-source ecosystem has had agent primitives — and has had them for close to a year.

Hermes specifically: it reviews past conversations, builds skills from experience, has persistent cross-session memory, and gets smarter the longer it runs. That’s Dreaming, functionally, shipped earlier and available to anyone willing to run it. If you want to understand the broader landscape of frameworks in this space, the comparison of GStack, Superpowers, and Hermes is worth reading before you commit to any of them.

Five Dimensions That Actually Separate These Approaches

Before the side-by-side, here are the criteria worth caring about.

Infrastructure ownership. Do you want Anthropic running your agent’s state, memory, and error recovery, or do you want that on your own infrastructure? This is partly a cost question and partly a data question. Financial services firms, for instance, may have opinions about where agent memory lives.

Configuration depth. Managed agents give you a well-defined surface area. Open-source gives you the full stack. The question is whether the configuration you need falls inside or outside that surface area.

Reliability and error recovery. Managed agents launched in April with sandbox, state management, and error recovery built in. Open-source implementations vary wildly on this. Hermes is production-grade for many teams, but you’re responsible for the failure modes.

Eval substrate quality. The Outcomes feature is interesting precisely because it makes external grading a default behavior rather than something you have to architect. The 10.1% PowerPoint improvement is a real number from a real benchmark. Open-source eval substrates exist, but they’re less standardized and require more setup to get to the same place.

Ecosystem and connectors. Anthropic shipped new connectors at Code with Claude: Dun & Bradstreet for business identity, Fiscal AI for market analysis, Verisk for insurance underwriting. The Claude Finance suite includes 10 pre-defined agents — pitch builder, meeting preparer, market researcher, evaluation reviewer, month-end closer, and others — deployable as plugins for co-work, Claude Code, or as managed agents. That’s a lot of domain-specific surface area that the open-source ecosystem doesn’t have pre-built.

Anthropic Managed Agents: What You Actually Get

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

The managed agents platform is best understood as infrastructure-as-a-service for the hard parts of running agents in production. You don’t have to build state management. You don’t have to build error recovery. You don’t have to architect a grading loop — Outcomes handles it. You don’t have to figure out how to spin up a cloud compute instance for your agent; Anthropic provides one.

The Dreaming feature is particularly interesting from a product standpoint because it changes the economics of agent improvement. Previously, if you wanted your agents to get better over time, you had to build that feedback loop yourself — extract patterns from logs, update prompts, manage memory stores. Dreaming makes that a scheduled background process. The Every/Spiral writing agent is a good example of what this looks like in practice: they use a multi-agent system with the Outcomes feature and an editorial rubric to enforce writing quality. The rubric is based on editorial standards and writer voice. That’s a subjective quality standard being enforced by an automated grader — which is genuinely new territory compared to the coding-task eval loops that have existed for a while.

The multi-agent orchestration piece is also more mature than it sounds. A lead agent that can check in on sub-agents mid-workflow, with the whole thing auditable in Claude Console, is meaningfully different from just firing off parallel API calls and hoping they converge. The shared file system and the auditable reasoning trail are what make it production-usable rather than a demo.

The Claude Finance cookbook is worth calling out specifically. Releasing a full cookbook alongside the agents means you can see exactly how they’re built and modify them. That’s a different posture than “here’s a black box of financial agents.” It’s closer to the open-source ethos of giving you the source of truth.

The tradeoffs are real, though. You’re on Anthropic’s infrastructure, which means you’re subject to their rate limits (which have been a genuine problem — the SpaceX compute deal exists because Anthropic was capacity-constrained for most of this year), their pricing, and their terms of service. The managed agents platform is also still relatively new. The April launch was the initial release; the Dreaming and Outcomes features are recent additions. You’re adopting something that’s still being built.

Hermes and OpenBrain: What You Actually Get

The open-source case is strongest for teams that need control, have the engineering capacity to run their own infrastructure, and are building workflows that don’t fit neatly into Anthropic’s surface area.

Hermes gives you persistent cross-session memory, skill-building from experience, and orchestration primitives that have been in production use for close to a year. The memory architecture is more configurable than Dreaming — you can decide what gets stored, how it’s structured, and where it lives. If you want to understand how persistent memory works at the architecture level, the Claude Code source leak’s three-layer memory architecture is a useful reference for how these systems are designed.

OpenBrain is specifically interesting for the memory and eval substrate side. It’s a personal Supabase database connected to AI via MCP, which means your agent memory is in a database you own and control. The full breakdown of what OpenBrain is covers the architecture in detail, but the short version is: if data residency matters to you, this is the path. You’re not handing your agent’s memory to a third party.

The eval substrate in the open-source ecosystem is less standardized than Outcomes, but it’s also more flexible. If your quality rubric is complex — multiple dimensions, weighted criteria, domain-specific standards — you have more room to build exactly what you need. The tradeoff is that you’re building it. The 10.1% PowerPoint improvement Anthropic cited came from a specific implementation of a grading agent. You can build the same thing; it just takes more work.

The honest limitation of the open-source path is operational overhead. State management, error recovery, infrastructure reliability — these are problems you own. For a small team building internal tooling, that’s often fine. For a team trying to ship a product to external users with SLA expectations, it’s a real cost.

MindStudio sits in an interesting middle position here: it’s a no-code platform with 200+ models and 1,000+ integrations, which means you can build multi-agent orchestration without writing the plumbing, but you’re not locked into a single model provider the way you are on the managed agents platform. That matters if you want to mix Claude with other models for cost optimization, which is exactly what Every/Spiral does.

Which Path Fits Which Situation

Use Anthropic managed agents if:

You’re building in financial services, insurance, or another domain where the new connectors (Dun & Bradstreet, Fiscal AI, Verisk) are directly relevant. The Claude Finance suite with its 10 pre-built agents gives you a starting point that would take weeks to replicate from scratch.

You want the Outcomes feature without building it. The grading agent pattern is well-understood, but having it as a default behavior rather than a custom build is a real time savings. If you’re building a report generation agent for non-technical users, Outcomes is the right call.

Your team doesn’t have the engineering capacity to run agent infrastructure. Managed agents handles sandbox, state management, and error recovery. That’s not trivial to build well.

You’re building on top of Claude Code specifically. Boris Churnney’s claim that Anthropic itself has zero manually written code — all produced by Claude agents coordinating over Slack — is a signal about where the platform is headed. The five Claude Code workflow patterns post covers the practical patterns that work in production.

Use Hermes/OpenBrain if:

Data residency is a constraint. If your agent memory can’t live on Anthropic’s infrastructure, OpenBrain’s Supabase-based approach is the answer.

You need configuration depth that exceeds what managed agents exposes. Complex memory structures, custom eval substrates, non-standard orchestration patterns — the open-source path gives you the full stack.

You’re already running Hermes in production and it’s working. The switching cost of moving to managed agents is real, and the open-source ecosystem has a year’s head start on some of these primitives. Don’t fix what isn’t broken.

You want model flexibility. Hermes isn’t tied to Claude. If your workflow benefits from mixing models — using a cheaper model for some sub-agents, a more capable one for others — the open-source path doesn’t constrain you.

The Deeper Question This Comparison Raises

The fact that open-source shipped Dreaming-like memory and Outcomes-like evals first isn’t a knock on Anthropic. It’s a structural feature of how innovation works in this space. Open-source projects can experiment faster because they’re not responsible for reliability at scale. Anthropic’s contribution is taking those primitives and making them production-grade with managed infrastructure, auditable reasoning, and a connector ecosystem.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

The question for builders is which layer of that stack you want to own. If you’re building something where the agent infrastructure is a means to an end — you want the output, not the plumbing — managed agents is probably the right call. If the agent infrastructure is itself a differentiator, or if you need control that managed agents doesn’t offer, the open-source path is worth the operational overhead.

On the code generation side, there’s a parallel abstraction happening. Tools like Remy take a different approach to the “what’s the source of truth” question: you write a spec in annotated markdown, and the full-stack application — TypeScript backend, SQLite database, auth, deployment — gets compiled from it. The spec is the source of truth; the code is derived output. That’s a different layer of abstraction than agent orchestration, but it’s the same underlying question: how much of the stack do you want to own versus have generated for you?

Anthropic’s roadmap hints — higher judgment and code taste, context windows that “feel infinite,” improved multi-agent coordination — suggest that the managed agents platform is going to keep closing the gap with what the open-source ecosystem has built. The question is whether it closes that gap fast enough for your timeline, and whether the control tradeoffs are acceptable for your use case.

The open-source ecosystem led on primitives. The closed platform is catching up with production-grade infrastructure. Both of those things are true simultaneously, and the right answer depends on what you’re building and what you can afford to own.

Anthropic Managed Agents vs. OpenBrain Open-Source: Did Hermes Ship This First?

Open Source Beat Anthropic to the Punch — Now What Do You Build On?

The Actual Gap Between What Shipped and When

Coding agents automate the 5%. Remy runs the 95%.

Five Dimensions That Actually Separate These Approaches

Anthropic Managed Agents: What You Actually Get

How Remy works. You talk. Remy ships.

Hermes and OpenBrain: What You Actually Get

Which Path Fits Which Situation

The Deeper Question This Comparison Raises

Everyone else built a construction worker.
We built the contractor.

Related Articles

Anthropic Managed Agents vs Open-Source Agent Frameworks: Which Should You Build On?

Vibe Kanban vs Paperclip vs Dispatch: Three Philosophies

Agent SDK vs Framework: When to Use Claude Agent SDK vs Pydantic AI for Production

Claude Code Agent Teams: Build a 5-Page Website with 3 Parallel Sub-Agents Running Simultaneously

Open Source Beat Anthropic to the Punch — Now What Do You Build On?

The Actual Gap Between What Shipped and When

Coding agents automate the 5%. Remy runs the 95%.

Five Dimensions That Actually Separate These Approaches

Anthropic Managed Agents: What You Actually Get

How Remy works. You talk. Remy ships.

Hermes and OpenBrain: What You Actually Get

Which Path Fits Which Situation

The Deeper Question This Comparison Raises

Everyone else built a construction worker.We built the contractor.

Related Articles

Anthropic Managed Agents vs Open-Source Agent Frameworks: Which Should You Build On?

Vibe Kanban vs Paperclip vs Dispatch: Three Philosophies

Agent SDK vs Framework: When to Use Claude Agent SDK vs Pydantic AI for Production

Claude Code Agent Teams: Build a 5-Page Website with 3 Parallel Sub-Agents Running Simultaneously

Everyone else built a construction worker.
We built the contractor.