How to Use GitHub Actions to Run AutoResearch Experiments on a Schedule
Deploy an AutoResearch loop to GitHub Actions to run A/B experiments on cold email, landing pages, or AI skills automatically every hour without a server.
Why GitHub Actions Is a Surprisingly Good Research Platform
Most teams that want to run automated experiments reach for a server, a cron job on a VM, or a paid scheduling tool. All of those work, but they add cost and maintenance overhead that’s hard to justify for an experiment loop that runs a few API calls every hour.
GitHub Actions changes that calculation. You get free compute on a schedule, version-controlled configuration, secrets management, and logging — all without provisioning anything. For running AutoResearch experiments, it’s a strong fit.
This guide walks through setting up a scheduled AutoResearch loop on GitHub Actions to run A/B experiments automatically — on cold email, landing page copy, or AI skill prompts — without maintaining a server.
What an AutoResearch Experiment Loop Actually Is
An AutoResearch loop is a self-contained cycle where AI both generates test variants and evaluates results — no human checkpoint required between iterations. The basic cycle looks like this:
- Generate variants — An LLM proposes N alternatives for the thing being tested (a subject line, a headline, a prompt)
- Deploy or record the variant — The variant is sent, published, or logged for comparison
- Collect performance data — Metrics are pulled from an API (open rates, conversion rates, output quality scores)
- Analyze results — The loop scores each variant and identifies a winner
- Update the baseline — The winning variant becomes the new control
- Repeat on schedule
Traditional A/B testing requires a human to review results and decide what to test next. AutoResearch replaces that step with an AI judgment call based on the data collected since the last cycle.
This works well for three categories in particular:
- Cold email — Subject lines, opening sentences, and CTAs tested with real sends
- Landing pages — Headlines and value propositions rotated and measured against conversion data
- AI skill optimization — System prompts or few-shot examples benchmarked using an LLM-as-judge pattern
The AI that generates variants can also evaluate results and decide what direction to test next. Deployed on GitHub Actions with a schedule trigger, the whole thing runs continuously without you touching it.
What You’ll Need Before Starting
Before writing any YAML, confirm you have:
- A GitHub account with a repository for the experiment project
- An LLM API key (OpenAI, Anthropic, or your preferred provider) stored as a GitHub secret
- Access to the platform API for whatever you’re testing — your email tool, analytics provider, or a custom endpoint
- A storage strategy for results — a JSON file committed back to the repo is the simplest starting point; Airtable or Google Sheets works if you need dashboards
- Basic familiarity with YAML and either Python or Node.js
You don’t need a server, a Docker registry, or a database. GitHub Actions provides the compute, and the repo itself handles versioned storage.
Setting Up the Scheduled Workflow
GitHub Actions workflows live in .github/workflows/ in your repository. Create a file called autoresearch.yml there.
Here’s the base structure for a workflow that runs every hour:
name: AutoResearch Experiment Loop
on:
schedule:
- cron: '0 * * * *' # Top of every hour, UTC
workflow_dispatch: # Manual trigger for testing
concurrency:
group: autoresearch
cancel-in-progress: false
jobs:
run-experiment:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run experiment loop
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EMAIL_API_KEY: ${{ secrets.EMAIL_API_KEY }}
run: python experiment.py
- name: Commit results
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add results/
git diff --staged --quiet || git commit -m "AutoResearch: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
git push
A few things worth noting:
workflow_dispatchadds a manual trigger button in the GitHub UI — useful for testing before the schedule kicks intimeout-minutes: 30prevents runaway jobs from consuming your free tier minutes- The commit step only fires when there are actual file changes, avoiding empty commits
- The
concurrencyblock ensures a new run waits for the current one to finish rather than running in parallel
Configuring Secrets
Never hardcode API keys in your workflow file. In your repository, go to Settings → Secrets and variables → Actions and add each key your script needs. Reference them in the YAML as ${{ secrets.YOUR_KEY_NAME }}.
Choosing Your Schedule
GitHub Actions uses standard cron syntax, always in UTC:
| Frequency | Cron expression |
|---|---|
| Every hour | 0 * * * * |
| Every 6 hours | 0 */6 * * * |
| Daily at 9am UTC | 0 9 * * * |
| Weekdays at 8am UTC | 0 8 * * 1-5 |
Note: GitHub may delay scheduled runs by a few minutes during high-traffic periods. This is fine for experiment loops but rules out anything requiring precise timing.
Building the Experiment Loop Script
The workflow above calls experiment.py. Here’s the structure that script needs.
Load Current State
import json
import os
from pathlib import Path
STATE_FILE = Path("results/state.json")
def load_state():
if STATE_FILE.exists():
return json.loads(STATE_FILE.read_text())
return {
"baseline": None,
"baseline_score": 0,
"active_variants": [],
"results": [],
"iteration": 0
}
The state file tracks what’s been tested and what the current baseline is. Committing it to the repo means state persists across runs — no external database required.
Generate New Variants
import openai
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def generate_variants(baseline: str, recent_results: str, n: int = 3) -> list:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a conversion optimization expert. Propose test variants."
},
{
"role": "user",
"content": f"""
Current baseline: {baseline}
Recent performance data: {recent_results}
Generate {n} alternative variants to test. Return as JSON: {{"variants": [...]}}
"""
}
],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return data.get("variants", [])
Passing recent results back to the LLM is important — it lets the model make informed proposals rather than guessing. After several iterations, the model builds implicit knowledge of what directions work for your specific context.
Collect and Score Results
def collect_and_score(state: dict) -> dict:
"""Pull metrics for variants deployed in the previous cycle."""
scores = {}
for variant in state.get("active_variants", []):
if variant["type"] == "email_subject":
metrics = get_email_metrics(variant["deployment_id"])
# Weighted score: open rate matters more than click rate for cold email
scores[variant["id"]] = {
"text": variant["text"],
"score": metrics["open_rate"] * 0.6 + metrics["click_rate"] * 0.4,
"metrics": metrics
}
return scores
Update the Baseline
def update_baseline(state: dict, scores: dict) -> tuple[dict, bool]:
updated = False
if not scores:
return state, updated
best_id = max(scores, key=lambda x: scores[x]["score"])
best_score = scores[best_id]["score"]
# Only update if there's a meaningful improvement and sufficient sample size
if best_score > state["baseline_score"] * 1.10: # At least 10% better
state["baseline"] = scores[best_id]["text"]
state["baseline_score"] = best_score
updated = True
state["results"].append({
"iteration": state["iteration"],
"scores": scores,
"baseline_updated": updated
})
return state, updated
The 10% improvement threshold helps avoid updating the baseline on noise. Combine this with a minimum sample size check (described in the common mistakes section) for more reliable results.
Three Practical Use Cases
Cold Email Subject Line Testing
Cold email is one of the strongest use cases for an AutoResearch loop. A consistent 5-point improvement in open rate compounded over weeks can meaningfully change pipeline. The loop works as follows:
- Use your current best-performing subject line as the baseline
- Generate three new variants targeting different angles — curiosity, specificity, and direct benefit
- Create campaigns in your email platform with each variant via API
- Pull open and reply rates after 24–48 hours
- Update the baseline if a variant beats the threshold
For email, run the variant deployment once per day and the analysis step every hour. This gives you enough data before generating new variants.
Landing Page Copy Optimization
For landing pages, the loop writes variant headlines or value propositions to a config file that your site reads. Combined with a conversion analytics API (GA4, Plausible, or PostHog), you can pull conversion rates per variant and rotate toward the best performer automatically.
A clean architecture: keep your variant config as a JSON file in the same repo. Your site reads from a CDN-hosted copy of that file. The loop updates the file, commits it, and your CDN serves the new variant — no CMS required.
AI Skill Prompt Optimization
This use case is valuable if you’re building AI agents. The loop tests different system prompts, few-shot examples, or output formats using an LLM-as-judge approach:
- Generate N variants of a system prompt
- Run each variant against a fixed test set of 10–20 representative inputs
- Use a separate LLM call to score each output against defined quality criteria
- Commit the highest-scoring prompt as the new baseline
The result is a self-improving agent whose prompt gets sharper with every cycle. Paired with MindStudio’s autonomous background agents, this kind of iterative prompt tuning can run indefinitely without manual review.
Storing and Accessing Results
Commit to the Repo
Committing results as JSON to your repository is the simplest strategy. It gives you version-controlled history, no cost, and an audit trail in the commit log. For most teams running 1–5 experiment loops, this is enough.
Write to External Storage
If you need queryable data or dashboards, write results to Airtable, Google Sheets, or Supabase via their APIs. All three have free tiers and are easy to connect from a Python script.
Notifications
Add a Slack notification step that fires when a new baseline is found:
- name: Notify on baseline update
if: env.BASELINE_UPDATED == 'true'
uses: 8398a7/action-slack@v3
with:
status: custom
custom_payload: |
{"text": "New baseline: ${{ env.NEW_BASELINE }}"}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Set the BASELINE_UPDATED environment variable from your Python script using $GITHUB_ENV so the workflow step can read it.
Where MindStudio Fits In
The setup described above requires you to write and maintain the Python script, manage dependencies, and wire up each API integration yourself. That’s a reasonable tradeoff if you want full control. But there’s a faster path if you’d rather focus on the experiment logic than the infrastructure around it.
MindStudio’s Agent Skills Plugin — an npm SDK called @mindstudio-ai/agent — lets you call MindStudio’s 120+ typed capabilities directly from code running inside GitHub Actions. This means your experiment script can call agent.sendEmail(), agent.searchGoogle(), or agent.runWorkflow() without setting up separate API integrations for each service.
Here’s how you’d call a MindStudio workflow from an Actions step:
- name: Run MindStudio AutoResearch workflow
run: |
npx @mindstudio-ai/agent run \
--workflow "autoresearch-cold-email" \
--input '{"baseline": "${{ env.CURRENT_BASELINE }}", "iteration": "${{ env.ITERATION }}"}'
env:
MINDSTUDIO_API_KEY: ${{ secrets.MINDSTUDIO_API_KEY }}
The workflow handles the AI generation and analysis. GitHub Actions handles the scheduling and state management. Each component does what it’s good at.
Alternatively, skip GitHub Actions entirely. MindStudio’s autonomous background agents run on a native schedule — you set the interval in the agent config, and MindStudio handles execution without an external scheduler. No YAML, no Python environment, no commit strategy required. You get 200+ AI models available out of the box and native integrations with email tools, Google Sheets, Airtable, and Slack already wired in.
For teams building their first experiment loop, the no-code path is often faster. For teams that want experiment logic version-controlled alongside other code, the hybrid approach — MindStudio as the AI reasoning layer, GitHub Actions as the orchestrator — tends to work well. You can try MindStudio free at mindstudio.ai.
Common Mistakes and How to Avoid Them
Running Experiments Too Frequently
Running a cold email loop every hour doesn’t make sense — you need at least 24 hours of data before drawing conclusions. Match your cron schedule to your data collection window. For email, daily variant generation and hourly result checks is a sensible cadence.
Starting Without a Baseline
If the loop starts with no baseline, the LLM generates variants against nothing useful. Always set an explicit starting baseline in your initial state file — even if it’s your current working best guess. The loop will improve from there.
Ignoring Statistical Significance
Picking a winner after a small sample produces false positives. Require both a minimum sample size (at least 100–200 sends per variant for email) and a minimum improvement percentage before updating the baseline. Python’s scipy.stats makes chi-square testing straightforward to add.
Missing the Concurrency Block
Without a concurrency configuration, GitHub may queue multiple scheduled runs when the repository is busy. The concurrency block shown in the workflow YAML earlier ensures runs execute sequentially rather than in parallel.
Accidentally Logging Secrets
GitHub scans for known secret formats, but a print statement debugging your API response can expose key fragments in committed output files. Always write result files to a separate subdirectory, review what’s being committed, and never log raw API responses in production.
Frequently Asked Questions
What is an AutoResearch experiment loop?
An AutoResearch loop is an automated cycle where AI both generates test variants and evaluates results — no human decision point between iterations. An LLM proposes alternatives for whatever is being optimized, the variants are deployed or tested, performance data is collected, and the best-performing variant becomes the new baseline. The process repeats on a schedule, compounding small improvements over time.
Can GitHub Actions replace a dedicated server for scheduled experiments?
For most experiment loops, yes. GitHub Actions handles scheduling, compute, logging, and secrets management. The main constraint is that free-tier private repos get 2,000 minutes per month — public repos are unlimited. A loop that runs hourly and takes under 5 minutes sits well within the free tier. Scheduled runs can be delayed by a few minutes during GitHub’s peak periods, which is usually acceptable.
How do I persist state between GitHub Actions runs without a database?
Commit a JSON state file back to your repository at the end of each run. This gives you version-controlled history and requires no external infrastructure. For larger datasets, use GitHub Artifacts or write to an external service like Airtable or Supabase. The commit approach is usually sufficient for experiment loops running at hourly or daily cadences.
How do I run A/B tests on cold email automatically?
Connect your email platform’s API to your experiment script. Generate subject line or body copy variants with an LLM, create separate campaigns or sequences for each variant, track open and reply rates via the API, and update the baseline to the winning variant once sufficient data is collected. Schedule variant generation once per day and result analysis more frequently. Most major platforms — Instantly, Smartlead, Mailchimp — expose the necessary endpoints.
How do I make sure the experiment doesn’t update the baseline on noise?
Two checks help here. First, require a minimum sample size before evaluating results — don’t compare variants until each has been sent to at least 100–200 recipients. Second, require a meaningful improvement threshold (10–15% better than baseline, not just 1%) before updating. For more rigorous validation, add a chi-square test or Z-score calculation using scipy.stats to confirm the difference is statistically significant rather than random variation.
What’s the best way to handle multiple experiment types in a single repo?
Use a separate state file per experiment type (e.g., results/email_state.json, results/landing_state.json) and a single workflow that runs all of them in sequence or as separate jobs. Keep experiment-specific configuration in YAML or JSON config files rather than hardcoded in the script. This makes it easy to add or remove experiments without changing the core loop logic.
Key Takeaways
- GitHub Actions provides free, scheduled compute with built-in secrets management and logging — a practical foundation for AutoResearch loops without server overhead
- The core pattern is: generate variants with an LLM → deploy → collect results → analyze → update baseline → repeat
- Cold email, landing page copy, and AI prompt optimization are well-suited use cases where small, consistent improvements compound over time
- Commit experiment state to the repo as JSON — it’s free, versioned, and requires no external database
- Add a concurrency block, minimum sample size requirements, and a statistical significance check before any baseline update
- For a no-code path to the same loop, MindStudio’s background agents handle scheduling and execution natively, with 200+ AI models and pre-built integrations already included