How to Use GitHub Actions to Run AutoResearch Experiments on a Schedule

Q: What's the best way to handle multiple experiment types in a single repo?

Use a separate state file per experiment type (e.g., results/emailstate.json, results/landingstate.json) and a single workflow that runs all of them in sequence or as separate jobs. Keep experiment-specific configuration in YAML or JSON config files rather than hardcoded in the script. This makes it easy to add or remove experiments without changing the core loop logic.

Why GitHub Actions Is a Surprisingly Good Research Platform

Most teams that want to run automated experiments reach for a server, a cron job on a VM, or a paid scheduling tool. All of those work, but they add cost and maintenance overhead that’s hard to justify for an experiment loop that runs a few API calls every hour.

GitHub Actions changes that calculation. You get free compute on a schedule, version-controlled configuration, secrets management, and logging — all without provisioning anything. For running AutoResearch experiments, it’s a strong fit.

This guide walks through setting up a scheduled AutoResearch loop on GitHub Actions to run A/B experiments automatically — on cold email, landing page copy, or AI skill prompts — without maintaining a server.

What an AutoResearch Experiment Loop Actually Is

An AutoResearch loop is a self-contained cycle where AI both generates test variants and evaluates results — no human checkpoint required between iterations. The basic cycle looks like this:

Generate variants — An LLM proposes N alternatives for the thing being tested (a subject line, a headline, a prompt)
Deploy or record the variant — The variant is sent, published, or logged for comparison
Collect performance data — Metrics are pulled from an API (open rates, conversion rates, output quality scores)
Analyze results — The loop scores each variant and identifies a winner
Update the baseline — The winning variant becomes the new control
Repeat on schedule

Hermes Crash Course — free 1-hour live workshop

Traditional A/B testing requires a human to review results and decide what to test next. AutoResearch replaces that step with an AI judgment call based on the data collected since the last cycle.

This works well for three categories in particular:

Cold email — Subject lines, opening sentences, and CTAs tested with real sends
Landing pages — Headlines and value propositions rotated and measured against conversion data
AI skill optimization — System prompts or few-shot examples benchmarked using an LLM-as-judge pattern

The AI that generates variants can also evaluate results and decide what direction to test next. Deployed on GitHub Actions with a schedule trigger, the whole thing runs continuously without you touching it.

What You’ll Need Before Starting

Before writing any YAML, confirm you have:

A GitHub account with a repository for the experiment project
An LLM API key (OpenAI, Anthropic, or your preferred provider) stored as a GitHub secret
Access to the platform API for whatever you’re testing — your email tool, analytics provider, or a custom endpoint
A storage strategy for results — a JSON file committed back to the repo is the simplest starting point; Airtable or Google Sheets works if you need dashboards
Basic familiarity with YAML and either Python or Node.js

You don’t need a server, a Docker registry, or a database. GitHub Actions provides the compute, and the repo itself handles versioned storage.

Setting Up the Scheduled Workflow

GitHub Actions workflows live in .github/workflows/ in your repository. Create a file called autoresearch.yml there.

Here’s the base structure for a workflow that runs every hour:

name: AutoResearch Experiment Loop

on:
  schedule:
    - cron: '0 * * * *'  # Top of every hour, UTC
  workflow_dispatch:        # Manual trigger for testing

concurrency:
  group: autoresearch
  cancel-in-progress: false

jobs:
  run-experiment:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run experiment loop
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          EMAIL_API_KEY: ${{ secrets.EMAIL_API_KEY }}
        run: python experiment.py

      - name: Commit results
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add results/
          git diff --staged --quiet || git commit -m "AutoResearch: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
          git push

A few things worth noting:

workflow_dispatch adds a manual trigger button in the GitHub UI — useful for testing before the schedule kicks in
timeout-minutes: 30 prevents runaway jobs from consuming your free tier minutes
The commit step only fires when there are actual file changes, avoiding empty commits
The concurrency block ensures a new run waits for the current one to finish rather than running in parallel

Configuring Secrets

Never hardcode API keys in your workflow file. In your repository, go to Settings → Secrets and variables → Actions and add each key your script needs. Reference them in the YAML as ${{ secrets.YOUR_KEY_NAME }}.

Choosing Your Schedule

GitHub Actions uses standard cron syntax, always in UTC:

Frequency	Cron expression
Every hour	`0 * * * *`
Every 6 hours	`0 /6 * *`
Daily at 9am UTC	`0 9 * * *`
Weekdays at 8am UTC	`0 8 * * 1-5`

Note: GitHub may delay scheduled runs by a few minutes during high-traffic periods. This is fine for experiment loops but rules out anything requiring precise timing.

Building the Experiment Loop Script

The workflow above calls experiment.py. Here’s the structure that script needs.

Load Current State

import json
import os
from pathlib import Path

STATE_FILE = Path("results/state.json")

def load_state():
    if STATE_FILE.exists():
        return json.loads(STATE_FILE.read_text())
    return {
        "baseline": None,
        "baseline_score": 0,
        "active_variants": [],
        "results": [],
        "iteration": 0
    }

The state file tracks what’s been tested and what the current baseline is. Committing it to the repo means state persists across runs — no external database required.

Generate New Variants

import openai

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def generate_variants(baseline: str, recent_results: str, n: int = 3) -> list:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a conversion optimization expert. Propose test variants."
            },
            {
                "role": "user",
                "content": f"""
Current baseline: {baseline}
Recent performance data: {recent_results}

Generate {n} alternative variants to test. Return as JSON: {{"variants": [...]}}
"""
            }
        ],
        response_format={"type": "json_object"}
    )
    data = json.loads(response.choices[0].message.content)
    return data.get("variants", [])

Passing recent results back to the LLM is important — it lets the model make informed proposals rather than guessing. After several iterations, the model builds implicit knowledge of what directions work for your specific context.

Collect and Score Results

def collect_and_score(state: dict) -> dict:
    """Pull metrics for variants deployed in the previous cycle."""
    scores = {}
    for variant in state.get("active_variants", []):
        if variant["type"] == "email_subject":
            metrics = get_email_metrics(variant["deployment_id"])
            # Weighted score: open rate matters more than click rate for cold email
            scores[variant["id"]] = {
                "text": variant["text"],
                "score": metrics["open_rate"] * 0.6 + metrics["click_rate"] * 0.4,
                "metrics": metrics
            }
    return scores

Update the Baseline

def update_baseline(state: dict, scores: dict) -> tuple[dict, bool]:
    updated = False
    if not scores:
        return state, updated

    best_id = max(scores, key=lambda x: scores[x]["score"])
    best_score = scores[best_id]["score"]

    # Only update if there's a meaningful improvement and sufficient sample size
    if best_score > state["baseline_score"] * 1.10:  # At least 10% better
        state["baseline"] = scores[best_id]["text"]
        state["baseline_score"] = best_score
        updated = True

    state["results"].append({
        "iteration": state["iteration"],
        "scores": scores,
        "baseline_updated": updated
    })
    return state, updated

The 10% improvement threshold helps avoid updating the baseline on noise. Combine this with a minimum sample size check (described in the common mistakes section) for more reliable results.

Three Practical Use Cases

Cold Email Subject Line Testing

Cold email is one of the strongest use cases for an AutoResearch loop. A consistent 5-point improvement in open rate compounded over weeks can meaningfully change pipeline. The loop works as follows:

Use your current best-performing subject line as the baseline
Generate three new variants targeting different angles — curiosity, specificity, and direct benefit
Create campaigns in your email platform with each variant via API
Pull open and reply rates after 24–48 hours
Update the baseline if a variant beats the threshold

For email, run the variant deployment once per day and the analysis step every hour. This gives you enough data before generating new variants.

Landing Page Copy Optimization

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

For landing pages, the loop writes variant headlines or value propositions to a config file that your site reads. Combined with a conversion analytics API (GA4, Plausible, or PostHog), you can pull conversion rates per variant and rotate toward the best performer automatically.

A clean architecture: keep your variant config as a JSON file in the same repo. Your site reads from a CDN-hosted copy of that file. The loop updates the file, commits it, and your CDN serves the new variant — no CMS required.

AI Skill Prompt Optimization

This use case is valuable if you’re building AI agents. The loop tests different system prompts, few-shot examples, or output formats using an LLM-as-judge approach:

Generate N variants of a system prompt
Run each variant against a fixed test set of 10–20 representative inputs
Use a separate LLM call to score each output against defined quality criteria
Commit the highest-scoring prompt as the new baseline

The result is a self-improving agent whose prompt gets sharper with every cycle. Paired with MindStudio’s autonomous background agents, this kind of iterative prompt tuning can run indefinitely without manual review.

Storing and Accessing Results

Commit to the Repo

Committing results as JSON to your repository is the simplest strategy. It gives you version-controlled history, no cost, and an audit trail in the commit log. For most teams running 1–5 experiment loops, this is enough.

Write to External Storage

If you need queryable data or dashboards, write results to Airtable, Google Sheets, or Supabase via their APIs. All three have free tiers and are easy to connect from a Python script.

Notifications

Add a Slack notification step that fires when a new baseline is found:

- name: Notify on baseline update
  if: env.BASELINE_UPDATED == 'true'
  uses: 8398a7/action-slack@v3
  with:
    status: custom
    custom_payload: |
      {"text": "New baseline: ${{ env.NEW_BASELINE }}"}
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Set the BASELINE_UPDATED environment variable from your Python script using $GITHUB_ENV so the workflow step can read it.

Where MindStudio Fits In

The setup described above requires you to write and maintain the Python script, manage dependencies, and wire up each API integration yourself. That’s a reasonable tradeoff if you want full control. But there’s a faster path if you’d rather focus on the experiment logic than the infrastructure around it.

MindStudio’s Agent Skills Plugin — an npm SDK called @mindstudio-ai/agent — lets you call MindStudio’s 120+ typed capabilities directly from code running inside GitHub Actions. This means your experiment script can call agent.sendEmail(), agent.searchGoogle(), or agent.runWorkflow() without setting up separate API integrations for each service.

Here’s how you’d call a MindStudio workflow from an Actions step:

- name: Run MindStudio AutoResearch workflow
  run: |
    npx @mindstudio-ai/agent run \
      --workflow "autoresearch-cold-email" \
      --input '{"baseline": "${{ env.CURRENT_BASELINE }}", "iteration": "${{ env.ITERATION }}"}'
  env:
    MINDSTUDIO_API_KEY: ${{ secrets.MINDSTUDIO_API_KEY }}

The workflow handles the AI generation and analysis. GitHub Actions handles the scheduling and state management. Each component does what it’s good at.

Alternatively, skip GitHub Actions entirely. MindStudio’s autonomous background agents run on a native schedule — you set the interval in the agent config, and MindStudio handles execution without an external scheduler. No YAML, no Python environment, no commit strategy required. You get 200+ AI models available out of the box and native integrations with email tools, Google Sheets, Airtable, and Slack already wired in.

For teams building their first experiment loop, the no-code path is often faster. For teams that want experiment logic version-controlled alongside other code, the hybrid approach — MindStudio as the AI reasoning layer, GitHub Actions as the orchestrator — tends to work well. You can try MindStudio free at mindstudio.ai.

Common Mistakes and How to Avoid Them

Running Experiments Too Frequently

Running a cold email loop every hour doesn’t make sense — you need at least 24 hours of data before drawing conclusions. Match your cron schedule to your data collection window. For email, daily variant generation and hourly result checks is a sensible cadence.

Starting Without a Baseline

If the loop starts with no baseline, the LLM generates variants against nothing useful. Always set an explicit starting baseline in your initial state file — even if it’s your current working best guess. The loop will improve from there.

Ignoring Statistical Significance

Picking a winner after a small sample produces false positives. Require both a minimum sample size (at least 100–200 sends per variant for email) and a minimum improvement percentage before updating the baseline. Python’s scipy.stats makes chi-square testing straightforward to add.

Missing the Concurrency Block

Without a concurrency configuration, GitHub may queue multiple scheduled runs when the repository is busy. The concurrency block shown in the workflow YAML earlier ensures runs execute sequentially rather than in parallel.

Accidentally Logging Secrets

GitHub scans for known secret formats, but a print statement debugging your API response can expose key fragments in committed output files. Always write result files to a separate subdirectory, review what’s being committed, and never log raw API responses in production.

Frequently Asked Questions

What is an AutoResearch experiment loop?

An AutoResearch loop is an automated cycle where AI both generates test variants and evaluates results — no human decision point between iterations. An LLM proposes alternatives for whatever is being optimized, the variants are deployed or tested, performance data is collected, and the best-performing variant becomes the new baseline. The process repeats on a schedule, compounding small improvements over time.

Can GitHub Actions replace a dedicated server for scheduled experiments?

For most experiment loops, yes. GitHub Actions handles scheduling, compute, logging, and secrets management. The main constraint is that free-tier private repos get 2,000 minutes per month — public repos are unlimited. A loop that runs hourly and takes under 5 minutes sits well within the free tier. Scheduled runs can be delayed by a few minutes during GitHub’s peak periods, which is usually acceptable.

How do I persist state between GitHub Actions runs without a database?

Commit a JSON state file back to your repository at the end of each run. This gives you version-controlled history and requires no external infrastructure. For larger datasets, use GitHub Artifacts or write to an external service like Airtable or Supabase. The commit approach is usually sufficient for experiment loops running at hourly or daily cadences.

How do I run A/B tests on cold email automatically?

Connect your email platform’s API to your experiment script. Generate subject line or body copy variants with an LLM, create separate campaigns or sequences for each variant, track open and reply rates via the API, and update the baseline to the winning variant once sufficient data is collected. Schedule variant generation once per day and result analysis more frequently. Most major platforms — Instantly, Smartlead, Mailchimp — expose the necessary endpoints.

How do I make sure the experiment doesn’t update the baseline on noise?

Two checks help here. First, require a minimum sample size before evaluating results — don’t compare variants until each has been sent to at least 100–200 recipients. Second, require a meaningful improvement threshold (10–15% better than baseline, not just 1%) before updating. For more rigorous validation, add a chi-square test or Z-score calculation using scipy.stats to confirm the difference is statistically significant rather than random variation.

What’s the best way to handle multiple experiment types in a single repo?

Use a separate state file per experiment type (e.g., results/email_state.json, results/landing_state.json) and a single workflow that runs all of them in sequence or as separate jobs. Keep experiment-specific configuration in YAML or JSON config files rather than hardcoded in the script. This makes it easy to add or remove experiments without changing the core loop logic.

Key Takeaways

GitHub Actions provides free, scheduled compute with built-in secrets management and logging — a practical foundation for AutoResearch loops without server overhead
The core pattern is: generate variants with an LLM → deploy → collect results → analyze → update baseline → repeat
Cold email, landing page copy, and AI prompt optimization are well-suited use cases where small, consistent improvements compound over time
Commit experiment state to the repo as JSON — it’s free, versioned, and requires no external database
Add a concurrency block, minimum sample size requirements, and a statistical significance check before any baseline update
For a no-code path to the same loop, MindStudio’s background agents handle scheduling and execution natively, with 200+ AI models and pre-built integrations already included