What Is the AutoResearch Loop? How to Apply Karpathy's Pattern to Business Optimization

Q: What business metrics work best with an AutoResearch loop?

The best metrics are: Automatically measurable — pullable via API without manual export Responsive within a reasonable time window — not metrics that take months to move High-volume enough — to detect signal reliably Tied directly to outcomes — not just activity metrics Email click-through rates, form conversion rates, reply rates, and ad ROAS are good candidates. Customer lifetime value and brand awareness are harder to use as primary loop metrics because they're slow to move and difficult to attribute to specific experiments.

The Loop That Optimizes Itself

Most business optimization still works the same way it always has: someone notices a problem, forms a hypothesis, runs a test, reads the results two weeks later, holds a meeting about it, and maybe implements a change. Then the cycle restarts — manually, whenever someone has bandwidth.

What if that cycle ran continuously, without anyone touching it between iterations?

That’s the idea behind the AutoResearch loop, a pattern Andrej Karpathy (former Director of AI at Tesla, co-founder of OpenAI) has described as a core mechanism for AI systems that improve themselves autonomously. Originally developed in the context of machine learning research, the AutoResearch pattern applies directly to business optimization — and with modern multi-agent AI systems, it’s practical without a team of engineers.

This article covers what the AutoResearch loop is, how it works, and how to apply it across real business domains.

What Karpathy’s AutoResearch Pattern Actually Is

Karpathy has described a model of AI-driven research where the human-in-the-loop is replaced by a continuous, self-directed cycle. In machine learning, the loop looks like this:

Generate a hypothesis — What change might improve performance?
Design an experiment — How do we test it?
Run the experiment — Execute, often across many parallel runs
Evaluate the result — Did it work? By how much?
Feed results back in — Use what was learned to generate better hypotheses
Repeat

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

The critical feature isn’t any individual step. It’s that the output of step 5 becomes the input to step 1 — automatically, without a human deciding whether to continue.

In ML, this mirrors how model training already works: each forward pass produces a loss, and that loss drives the next weight update. AutoResearch extends the same logic upward — not just to model parameters, but to the research strategy itself. The system decides what to study next based on what it’s already learned.

Karpathy has written and spoken about the potential of AI systems to run their own research agendas, compress research cycles from months to days, and explore hypotheses at a scale no human team could match. The business implication is straightforward: if you can build a loop that runs experiments, measures outcomes, and informs next steps automatically, you compress your own optimization cycle the same way.

Why This Is Different From Standard Automation

Standard automation executes a fixed process. An automation that sends a welcome email when someone signs up doesn’t learn whether the email worked. It doesn’t change the subject line next time.

The AutoResearch loop is different because it includes a feedback mechanism that shapes future behavior. Each iteration produces information that changes what happens in the next iteration. That’s optimization, not just execution.

This distinction is what makes the pattern worth understanding even if you’re nowhere near machine learning.

The Four Components of Any AutoResearch Loop

Whether you’re optimizing a neural network or a sales email sequence, every AutoResearch loop has four components. Understanding these makes it easier to see how to build one for your own context.

1. A Generator

The generator proposes the next experiment. In ML, it might be a meta-learning system suggesting architectural changes. In a business context, it might be an AI agent that drafts five variations of a landing page headline based on historical conversion data.

The generator needs to learn from history — not just produce random variations, but informed ones that reflect what’s already been tried.

2. An Executor

The executor runs the experiment. In ML, this trains a model. In business, it deploys a campaign, publishes a content variant, runs an outreach sequence, or adjusts a pricing configuration.

For the loop to run autonomously, execution must be automated. If a human has to manually push “go” on each experiment, the loop is only semi-autonomous — which limits how many cycles you can run.

3. An Evaluator

After the experiment runs, something has to measure what happened. In ML, this is a validation metric. In business, it might be click-through rate, reply rate, conversion rate, or revenue per session.

The evaluator needs to be specific and automated. If reading results requires human interpretation, the loop breaks at this step.

4. A Memory and Synthesis Layer

This is the component most early implementations miss. After evaluation, results need to go somewhere that influences the next hypothesis. A plain text log isn’t enough — the system needs to synthesize what it learned and feed that synthesis into the generator’s context.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

In a multi-agent system, this is often a dedicated agent that reads experiment history, identifies patterns, and writes structured notes the generator uses in the next cycle.

Why the Pattern Works Outside Machine Learning

The AutoResearch loop was built for ML, but it works anywhere you have:

A measurable outcome — something you can quantify automatically
A controllable variable — something you can change systematically
Sufficient volume — enough data to detect signal within a reasonable timeframe
A repeatable process — something that can be automated end-to-end

Most business functions qualify. Marketing has conversion rates and audience segments. Sales has reply rates and pipeline metrics. Operations has processing times and error rates. Pricing has revenue per transaction.

The historical bottleneck was execution and evaluation — both required humans to deploy experiments and interpret results. As AI agents gain the ability to connect to live tools and pull data programmatically, that bottleneck is largely gone.

Six Business Domains Where This Pattern Works

Email Marketing and Outreach

An AutoResearch loop for email might:

Generate: Draft five subject line variants based on historical open rate data, with notes on which angles haven’t been tested recently
Execute: Deploy variants via A/B test in your email platform
Evaluate: Pull open rate, click rate, and unsubscribe rate after 48 hours
Synthesize: Update the hypothesis log with what worked, and generate the next round

Over time, the generator builds a richer model of what resonates with your specific audience — without anyone manually reviewing every campaign and deciding what to test next.

Content SEO

For content teams:

Generate: Propose title variations, meta description tests, or internal linking changes based on current ranking data and competitor gaps
Execute: Publish or update content
Evaluate: Pull ranking changes, click-through rate from Search Console, and time-on-page after 2–4 weeks
Synthesize: Identify which content patterns correlate with ranking improvements; generate next recommendations

This doesn’t replace a content strategist. But it automates the hypothesis-generation and measurement work that currently consumes a lot of their time.

Sales Outreach Sequences

Generate: Vary framing, length, CTA placement, and send timing across outreach messages
Execute: Deploy sequences through your CRM or outreach tool
Evaluate: Track reply rates, meeting bookings, and pipeline created by variant
Synthesize: Surface which combination of persona, message frame, and timing predicts response; retire what’s not working

The loop learns which signals predict engagement, without a sales manager manually reviewing every campaign performance report.

Pricing and Packaging

Pricing optimization has traditionally required expensive consultants or years of accumulated intuition. An AutoResearch loop can:

Generate: Propose specific price point or bundle variations within a defined range
Execute: Apply via feature flags or staging environments
Evaluate: Track conversion rate, average order value, and retention by cohort
Synthesize: Identify which configurations maximize revenue without damaging retention

This needs guardrails — you don’t want wildly different prices shown to customers in ways that erode trust. But for testing between defined options, the loop can run safely.

Customer Support Routing

Generate: Propose new routing rules or resolution template variants
Execute: Apply rules to a controlled segment of incoming tickets
Evaluate: Measure resolution rate, CSAT, and escalation rate
Synthesize: Update routing logic based on what resolved tickets fastest and with the best outcomes

Wondering what the Hermes hype is about? Free 60-minute primer

Over time, the loop builds a better model of which ticket types are best handled by which path — without requiring manual analysis of support data.

Ad Creative and Targeting

Paid advertising is one of the most natural fits, since ad platforms already have testing built in:

Generate: Propose new creative concepts, copy angles, or audience segments
Execute: Launch test campaigns
Evaluate: Pull CPC, CTR, conversion rate, and ROAS
Synthesize: Retire losing variants, scale winners, generate next round

Most teams already run ad tests. The AutoResearch loop automates the parts that currently require a human to sit down and decide what to test next.

How to Build an AutoResearch Loop: A Step-by-Step Approach

Step 1: Pick One Narrow Problem

Start with a single, specific optimization problem. “Improve marketing” is too broad. “Improve the subject line open rate for the weekly newsletter” is specific enough to build a loop around.

The narrower the scope, the easier it is to define your evaluator metric and automate execution.

Step 2: Define Your Measurement Before You Build

The evaluator is the load-bearing component. Confirm before building anything else:

Can you pull this metric automatically via API or integration?
Is there enough volume to detect meaningful signal within your time window?
Is the metric tied to the outcome you care about, or is it a proxy?

If you can’t answer yes to all three, adjust the metric first.

Step 3: Automate the Execution Path

Get the executor working before adding AI to the generator. If you can’t automatically deploy an email campaign or publish a content variant, there’s nothing for an AI to drive.

Work through the execution manually first. Automate it. Then add the AI layer on top.

Step 4: Build a Structured Experiment Log

The synthesis layer depends on clean, structured data about past experiments. A plain log won’t work well. Build a simple structured store — a spreadsheet, Airtable table, or database — that captures:

Experiment ID and date
What was varied
What was held constant
The metric outcome
A brief synthesis note

Step 5: Add an AI Agent for Generation and Synthesis

Once execution and evaluation are running, add an AI agent that:

Reads the experiment log
Identifies patterns in what’s worked and what hasn’t
Proposes the next hypothesis with a rationale
After each result comes in, writes a synthesis note to the log

This is where multi-agent AI workflows become useful — you can have a dedicated generator agent, a separate evaluator that interprets results, and an orchestrator that coordinates the cycle.

Step 6: Set Guardrails

Any autonomous loop needs limits. Define:

Budget limits: Maximum spend before requiring human review
Scope limits: What the loop can and cannot change
Escalation triggers: If a key metric drops below a threshold, pause and alert
Review intervals: Even a highly autonomous loop should have a human checkpoint weekly

The goal isn’t to remove humans from the process entirely. It’s to remove humans from the repetitive, low-judgment steps.

How MindStudio Makes This Practical Without Engineering

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Building an AutoResearch loop from scratch requires connecting multiple systems, scheduling background processes, and managing data flow between agents. That’s a lot of infrastructure to build before you can test whether the loop even works.

MindStudio is built for exactly this kind of multi-agent, multi-step workflow — and it removes most of the infrastructure overhead.

Here’s how the four loop components map to what you can build in MindStudio:

Generator: Build an AI agent that reads from a structured experiment log connected via Airtable, Google Sheets, or Notion, then uses a language model to propose the next experiment. With 200+ models available — including Claude, GPT, and Gemini — you can pick the one best suited for analytical reasoning or creative generation depending on your use case.

Executor: Use MindStudio’s 1,000+ pre-built integrations to connect the generator to your execution layer. For email, connect to HubSpot or Mailchimp. For content, connect to your CMS via webhook. For ads, connect to your ad platform. The agent deploys experiments without manual intervention.

Evaluator: Build a scheduled background agent that runs on a set cadence — daily, weekly, or per-campaign — pulls metric data from your analytics source, and writes structured results back to the experiment log automatically.

Synthesis layer: A separate agent reads completed experiment entries, writes a synthesis note (what worked, what didn’t, what to explore next), and that note feeds directly into the generator’s context on the next cycle.

The result is a workflow that runs in the background continuously. You check the experiment log when you want — rather than being the person driving each iteration manually. MindStudio’s background agents run on a schedule, so the loop keeps moving overnight and over weekends.

You can start building for free at mindstudio.ai.

Common Mistakes to Avoid

Optimizing for the Wrong Metric

The evaluator metric needs to track what you actually care about. Open rate is easy to measure; revenue per subscriber is harder. Optimizing purely for opens can produce clickbait subject lines that damage long-term list health.

Define the right metric before building anything — and revisit it regularly.

Running the Loop With Insufficient Volume

If you’re only sending 200 emails a week, individual experiments won’t produce statistically reliable signal. Small loops need longer evaluation windows — or you’ll act on noise rather than real patterns.

Never Reviewing the Synthesis

Even if the loop is running well, review the synthesis notes regularly. The agent’s interpretation of results can drift, or it can get stuck in a local optimum — running small variations on a winner instead of exploring genuinely new directions.

A weekly human review of the experiment log keeps the loop from stagnating.

Over-Engineering the First Version

A loop that generates two email subject line variants, tracks open rates in a spreadsheet, and uses a simple prompt to decide which direction to explore next week is a valid AutoResearch loop. Start there. You can add complexity once you’ve confirmed the basic cycle works.

Frequently Asked Questions

What is the AutoResearch loop?

The AutoResearch loop is an autonomous optimization cycle where an AI agent generates hypotheses, runs experiments, measures results, and uses those results to generate the next round of experiments — without requiring human input between iterations. Andrej Karpathy has described this pattern in the context of AI research systems that can design and execute their own experiments at scale.

Is this the same as A/B testing?

Not exactly. Standard A/B testing is a single experiment: you define two variants, run the test, and pick a winner. The AutoResearch loop is a continuous cycle where each experiment is informed by the results of the previous one. An AI, rather than a human, decides what to test next — and the system compounds learning over many iterations rather than treating each test as independent.

Do I need technical skills to implement this?

Not necessarily. The conceptual pattern — generate, execute, evaluate, synthesize — can be implemented with no-code tools and existing platforms. The hardest part is usually automating the execution step. If your experimentation platform (email tool, ad platform, CMS) exposes an API or integration layer, the rest can be built without writing code — including with platforms like MindStudio designed for building autonomous AI agents.

What business metrics work best with an AutoResearch loop?

The best metrics are:

Automatically measurable — pullable via API without manual export
Responsive within a reasonable time window — not metrics that take months to move
High-volume enough — to detect signal reliably
Tied directly to outcomes — not just activity metrics

Email click-through rates, form conversion rates, reply rates, and ad ROAS are good candidates. Customer lifetime value and brand awareness are harder to use as primary loop metrics because they’re slow to move and difficult to attribute to specific experiments.

The AutoResearch loop is one of the clearest practical examples of agentic AI. An agent in this context is an AI system that takes actions, observes results, and uses those results to decide on next actions — rather than simply responding to a single prompt. Multi-agent implementations assign different components of the loop to specialized agents that coordinate with each other, which typically produces better results than a single agent trying to handle generation, evaluation, and synthesis simultaneously.

Can this work for internal operations, not just customer-facing processes?

Yes. The pattern applies equally well to internal optimization. Examples include:

Refining the sequence of steps in a fulfillment workflow to reduce processing time
Testing different templates for handoffs between teams
Improving routing rules for internal support tickets
Optimizing the prompts used by internal AI tools

Anywhere you have a process with a measurable outcome and a controllable variable, the loop can run — regardless of whether it’s customer-facing or not.

Key Takeaways

The AutoResearch loop is an autonomous optimization cycle: generate a hypothesis, run an experiment, measure results, synthesize learnings, repeat — without human involvement between iterations.
Karpathy’s original framing was for ML research, but the pattern applies to any business process with a measurable outcome and a controllable variable.
The four essential components are: Generator, Executor, Evaluator, and a Memory/Synthesis layer.
The loop’s value compounds over time — each iteration is better informed than the last because it incorporates what came before.
Common failure modes include optimizing for the wrong metric, running loops with insufficient volume, and skipping regular human review of synthesis outputs.
Start narrow: one specific problem, automated execution, a clean metric, and a weekly human checkpoint. Add complexity once the basic cycle works.
No-code platforms like MindStudio make multi-agent AutoResearch loops practical for teams without dedicated engineering resources.

Hermes Crash Course — free 1-hour live workshop

A simple loop running continuously will compound through dozens of iterations while a more elaborate manual process completes its first cycle. The practical advantage goes to whoever starts iterating first — not whoever designs the most sophisticated system.