What Is the Remote Labor Index? Why AI Agents Complete Only 2.5% of Real Freelance Work

The Benchmark That Put AI Agents to the Test

A 2.5% success rate. That’s what frontier AI agents achieved when Scale AI tested them on 240 real projects sourced from Upwork.

The Remote Labor Index — Scale AI’s benchmark for measuring AI agent performance on genuine freelance work — cuts through the usual benchmarking noise. Instead of synthetic test sets or curated academic problems, it uses actual jobs that real clients paid real humans to complete.

The result is an uncomfortable number. While top AI models routinely score 80–90% on coding benchmarks and claim near-human performance on standardized tests, they can barely handle 1 in 40 genuine freelance tasks end-to-end.

Understanding why that gap exists matters — not just for AI researchers, but for anyone building or deploying AI agents today.

What the Remote Labor Index Actually Measures

The Remote Labor Index (RLI) was developed by Scale AI to answer a simple but hard question: can AI agents do the kind of work that humans currently get paid to do?

The benchmark uses real Upwork job postings as its test environment. Scale AI pulled 240 projects spanning categories like software development, data analysis, writing, and web research. These weren’t made-up exercises — they were jobs that freelancers actually took, completed, and got paid for.

How the Tasks Were Selected

The 240 tasks cover a range of real freelance work:

Software development — Writing, debugging, and extending code for real projects
Data analysis and processing — Cleaning datasets, running analyses, producing reports
Web research — Finding, aggregating, and synthesizing information from across the web
Writing and editing — Creating content, summarizing documents, drafting communications
File and format manipulation — Tasks requiring structured output in specific formats

These aren’t cherry-picked to be either easy or hard. They represent the actual distribution of work that flows through a platform like Upwork on a given day.

How Success Is Defined

This is where the RLI differs most from typical benchmarks. Success means completing the task to a level that a client would actually accept — not getting 80% of the way there, not producing something that looks roughly correct, but actually finishing the job.

Evaluation uses reference outputs from human freelancers as gold standards. Automated scoring checks whether the agent’s output matches the expected deliverable in meaningful ways. Partial credit isn’t the point. Either the job is done or it isn’t.

That binary framing is harsh. It’s also realistic. A client hiring on Upwork doesn’t pay half the rate for half the work.

What a 2.5% Success Rate Actually Means

Let’s be precise: 2.5% means frontier AI agents — the best models available — successfully completed approximately 6 out of 240 tasks.

That number will surprise anyone who follows AI model announcements. GPT-4-class models score above 85% on HumanEval (a standard code generation benchmark). Leading models approach or exceed human-level performance on dozens of academic evaluations. These aren’t weak models.

The Benchmark Gap

The contrast between controlled benchmark performance and real-world task completion isn’t new. AI systems have consistently performed better on structured tests than on open-ended problems. But the RLI makes that gap concrete and hard to dismiss.

There are a few reasons controlled benchmarks inflate expectations:

Narrow scope — A benchmark like HumanEval gives a model a well-defined function to write. A real coding task might involve understanding an existing codebase, interpreting vague requirements, running code to check for bugs, and communicating about edge cases.
Single-turn evaluation — Most benchmarks test one-shot performance. Real work is iterative.
Clean inputs — Benchmarks provide precise, unambiguous instructions. Upwork tasks come with the kind of instructions that humans give humans — casual, sometimes incomplete, often assuming shared context.

Why 97.5% Failure Isn’t Surprising

Completing a freelance task end-to-end requires threading together many capabilities simultaneously: understanding the ask, planning a sequence of steps, using the right tools, handling unexpected situations, producing a deliverable in the right format, and verifying that it actually works.

Fail at any one of those steps, and the task fails.

There’s a straightforward way to see why this matters. If an agent has a 90% success rate at each of 10 sequential steps, the probability of completing all 10 correctly is 0.9^10 — about 35%. For a more complex task with 20+ steps, that probability drops below 12%, even with individually strong performance at each step.

The RLI captures that compounding failure. Most other benchmarks don’t.

The Specific Reasons AI Agents Fail at Real Work

The 2.5% figure isn’t random. Understanding where agents break down is more useful than the headline number.

Instruction Ambiguity

Real client instructions are messy. “Build me a scraper for this website” leaves open which data, in what format, with what error handling, and whether JavaScript-rendered content needs to be supported.

Humans handle ambiguity by asking clarifying questions or making reasonable inferences based on context. Current AI agents tend to either make overconfident assumptions or ask too many clarifying questions. Neither works well in practice, and the RLI’s evaluation doesn’t reward attempts — only results.

Multi-Step Execution Without Guardrails

Most real tasks require an agent to plan and execute a sequence of steps, often with dependencies between them. If step 3 fails or produces an unexpected output, steps 4 through 12 may be built on flawed foundations.

Autonomous agents need not just to execute individual steps but to detect when something has gone wrong and correct course. That error-detection and recovery capability is still weak in most current systems.

Tool Use at Scale

A freelance task might require an agent to browse websites, write and run code, read and write files, call external APIs, and format output — all within a single job. Using one tool in isolation is something frontier models handle reasonably well. Orchestrating multiple tools across a multi-step workflow is significantly harder.

The failure modes here aren’t always obvious. An agent might correctly use each tool individually but chain them in the wrong order, fail to pass outputs between them cleanly, or hit rate limits and error states it can’t handle gracefully.

Real Environments Are Unpredictable

Controlled benchmarks use stable, predictable environments. Real work happens in the messy real world: websites change structure, APIs return unexpected errors, files arrive in nonstandard formats, and edge cases appear constantly.

Agents trained and evaluated in controlled settings often lack the resilience to handle situations they haven’t encountered before. And because real tasks have long execution paths, even one unexpected environment state can derail the entire job.

Output Verification

Humans naturally sanity-check their work. A developer runs their code and checks the output. A researcher reads the report they’ve written and catches obvious errors before submitting.

AI agents often lack robust self-verification loops. They generate outputs and return them without confirming that those outputs actually satisfy the task requirements. The RLI’s strict evaluation criteria surface this problem directly.

What the RLI Tells Us About Where AI Agents Actually Stand

The Remote Labor Index isn’t an argument that AI agents are useless. It’s an argument for a more accurate map of what they can and can’t do right now.

Narrow Tasks vs. Complete Jobs

AI agents are genuinely good at bounded, well-defined subtasks. Write a function that does X. Summarize this document. Extract structured data from this text. Generate a first draft from this brief. These are things current models handle reliably.

Complete jobs are different. They require stringing many subtasks together correctly, handling ambiguity throughout, and producing a finished deliverable — not just progress toward one.

That distinction matters because a lot of AI deployment advice conflates the two. “AI can replace your freelancer” is usually wrong. “AI can handle specific, well-defined parts of your workflow” is often right.

Benchmark Scores vs. Real Capability

The RLI also highlights a broader issue with how AI progress is measured. When models are fine-tuned or prompted to perform well on specific benchmarks, those scores stop being reliable indicators of general capability.

Real-world performance on diverse, open-ended tasks is much harder to optimize for. The RLI’s design — using actual freelance work as the test set — makes it resistant to the kind of narrow optimization that inflates standard benchmark scores. That’s part of what makes it a more honest signal about where the technology actually is.

The Automation Ceiling

For enterprise teams thinking about AI automation, the RLI is a useful calibration. It doesn’t mean automation is impossible — it means automation works best when tasks are narrow, well-defined, and structured.

The more a task looks like a real open-ended job (ambiguous, multi-step, requiring judgment calls), the less likely current AI agents are to complete it autonomously. The more it looks like a repeatable, well-specified workflow, the more likely automation will succeed reliably.

How MindStudio Approaches the Problem

The RLI’s findings clarify something important about AI agent design: success rates go up dramatically when agents operate within defined workflows rather than trying to interpret open-ended, ambiguous jobs.

This is exactly where MindStudio is designed to work. Instead of deploying a general-purpose AI agent and hoping it figures out a complex, unstructured task, MindStudio lets teams build purpose-built agents for specific, well-understood workflows.

Purpose-Built vs. General-Purpose Agents

A general AI agent given “scrape this data, clean it, and produce a report” faces all the failure modes the RLI documents: ambiguous scope, multi-step chaining, tool orchestration failures, weak output verification.

A MindStudio agent built specifically for that workflow is different. The steps are defined explicitly. The tools are wired up in advance. The output format is specified. The agent’s job is to execute reliably within a known structure — not to figure out what structure to use from scratch.

That’s not a workaround for AI limitations. It’s an accurate understanding of where AI agents deliver real value, and designing accordingly.

Workflow-First Agent Design

MindStudio’s visual workflow builder lets teams map out multi-step processes explicitly — the same kind of multi-step execution that causes AI agents to fail when left to their own devices. Each step is defined, each tool connection is configured, and the workflow is tested before it runs in production.

With 200+ AI models available out of the box and 1,000+ pre-built integrations with tools like HubSpot, Salesforce, Google Workspace, and Airtable, teams can route the right model to the right step rather than relying on a single general-purpose agent to handle everything.

The result is agents that work reliably — not because they’re more powerful than frontier models, but because they’re deployed against tasks they’re actually capable of handling, with structure that prevents the compounding failures the RLI documents.

You can build and test a MindStudio agent free at mindstudio.ai. Most agents take between 15 minutes and an hour to build.

Frequently Asked Questions

What is the Remote Labor Index?

The Remote Labor Index (RLI) is a benchmark created by Scale AI that measures how well AI agents perform on real freelance tasks. It uses 240 actual projects sourced from Upwork — spanning software development, data analysis, writing, and web research — and evaluates whether AI agents can complete those tasks to a standard a paying client would accept. It’s one of the few AI benchmarks built around real-world work rather than synthetic test cases.

Why did AI agents only complete 2.5% of tasks successfully?

The 2.5% completion rate reflects how real work differs from controlled benchmark conditions. Freelance tasks require multi-step execution, handling ambiguous instructions, orchestrating multiple tools in sequence, and producing verified deliverables — all without human intervention or correction. Each of those requirements introduces failure points. When they compound across a long task, overall success rates collapse even when individual capabilities are strong.

How is the Remote Labor Index different from standard AI benchmarks?

Most benchmarks test narrow, single-turn capabilities: write a function, answer a question, summarize a passage. The RLI tests end-to-end task completion against work that human freelancers actually completed. Success is binary — the job is either done or it isn’t. This framing makes the benchmark harder to game through narrow fine-tuning and more representative of how agents perform in practice.

Does a 2.5% success rate mean AI agents aren’t useful?

No. The RLI tests open-ended, general-purpose task completion on ambiguous real-world jobs. AI agents perform substantially better when deployed against specific, well-defined workflows rather than arbitrary freelance tasks. The finding is a prompt to match agent design to task structure — not an argument against AI automation broadly.

What kinds of tasks do AI agents handle reliably?

AI agents perform best on tasks with clear, narrow scope: generating code from precise specifications, extracting structured data from consistent sources, summarizing documents, drafting communications from templates, and processing structured data at scale. The pattern is consistent: constrained, repeatable, well-specified tasks are where AI automation works reliably. Open-ended, ambiguous, multi-step jobs with unpredictable environments are where it still breaks down.

What does the Remote Labor Index mean for enterprise AI deployment?

For enterprise teams, the RLI reinforces the importance of task design before agent deployment. Rather than assuming a general AI agent can handle complex, open-ended jobs, effective automation starts with breaking work into structured, repeatable steps with explicit success criteria. The right architecture — purpose-built agents with defined workflows — performs far better than deploying general models against vague instructions.

Key Takeaways

Scale AI’s Remote Labor Index tested frontier AI agents on 240 real Upwork projects and found a 2.5% success rate — roughly 6 completed tasks out of 240.
The failure rate reflects how real freelance work differs from controlled benchmarks: it’s multi-step, ambiguous, tool-dependent, and requires end-to-end execution without guardrails.
The gap between benchmark performance (often 80–90%+) and real-world task completion (2.5%) shows that benchmark scores alone are poor predictors of agent reliability in practice.
Compounding failure across sequential steps explains much of the gap — even small per-step failure rates produce low overall task completion rates.
AI agents work best when deployed against specific, well-defined workflows — not open-ended jobs that require interpreting ambiguous instructions and recovering from unexpected states.
Purpose-built agents with explicit workflow structure outperform general agents on real work by reducing the compounding failure risk that comes with long, ambiguous task chains.

The lesson from the Remote Labor Index is clear: define the workflow first, then deploy the agent. MindStudio makes that process fast — most purpose-built agents take 15 minutes to an hour to build — and gives teams access to 200+ AI models and 1,000+ integrations without writing code. Try it free at mindstudio.ai.