How to Optimize Web Scraping Skills for AI Agents: 6 Token-Saving Techniques

Why Web Scraping Drains Your AI Agent’s Token Budget

Web scraping is one of the most common skills you’ll add to an AI agent. Whether you’re building a price monitor, a research assistant, or a competitive intelligence tool, the agent needs to pull data from the web. But raw web scraping — passing unfiltered HTML directly to a language model — is one of the fastest ways to burn through your token budget.

A typical product page contains 50,000–150,000 characters of HTML. After tokenization, that’s 12,000–40,000 tokens for a single page. Multiply that by 100 pages per run and you’re spending significant money on input tokens before the model does any meaningful reasoning.

The problem is that most of that HTML is noise. Navigation bars, cookie banners, script tags, style blocks, and ad placeholders account for 70–90% of most pages — and none of it helps the model extract the data you care about.

This guide covers six techniques to reduce token usage in web scraping AI agent workflows by 80–95%, without sacrificing data quality.

The Signal-to-Noise Problem

When an AI agent fetches a web page, the raw HTML typically contains:

Scripts and styles: <script> blocks (JavaScript the model can’t run) and <style> blocks (CSS that’s irrelevant to extraction)
Navigation: Headers, footers, sidebars, and menus — repeated across every page of the site
Tracking code: Analytics pixels, ad tags, and third-party embeds
Structural markup: Dozens of nested <div> elements that exist for layout, not content
Actual content: The article text, product details, or data you need — often just 5–15% of total tokens

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Optimizing web scraping in AI agents is largely about collapsing that ratio. The six techniques below each attack a different part of the problem.

Technique 1: Filter HTML Before It Reaches the Model

The highest-impact change you can make is stripping unnecessary elements from the HTML before passing anything to the language model. This alone typically cuts token usage by 50–70%.

What to remove

At minimum, strip:

All <script> and <style> blocks
<nav>, <header>, <footer>, and <aside> elements
HTML comments ()
Tracking and ad-related elements
Redundant attributes: class, id, style, data-*, and aria-* (unless needed for selection)

Implementation in Python

from bs4 import BeautifulSoup, Comment

def filter_html(raw_html: str) -> str:
    soup = BeautifulSoup(raw_html, 'html.parser')

    # Remove noise elements
    for tag in soup.find_all(['script', 'style', 'nav', 'header', 'footer', 'aside']):
        tag.decompose()

    # Remove HTML comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Strip verbose attributes
    for tag in soup.find_all(True):
        for attr in ['class', 'id', 'style', 'data-testid', 'aria-label']:
            tag.attrs.pop(attr, None)

    return str(soup)

For most pages, this reduces token count by 50–70% with no change to the underlying information the model receives. Combined with the next technique, you can usually get above 85%.

Technique 2: Hardcode Selectors for Known Page Structures

If your agent scrapes the same site repeatedly — a competitor’s pricing page, a product catalog, a news feed — don’t ask the model to locate the data. Tell it exactly where to look.

CSS selectors and XPath expressions extract specific elements without the model ever seeing surrounding HTML. The model receives only the targeted content.

Example: extracting product data

Instead of sending 20,000 tokens of HTML and prompting “find the price on this page,” extract directly:

from bs4 import BeautifulSoup

def extract_product_data(html: str, selectors: dict) -> dict:
    soup = BeautifulSoup(html, 'html.parser')
    result = {}
    for field, selector in selectors.items():
        element = soup.select_one(selector)
        result[field] = element.get_text(strip=True) if element else None
    return result

# Configuration
selectors = {
    "title": "h1.product-title",
    "price": ".price-block .current-price",
    "description": ".product-description p:first-child"
}

data = extract_product_data(html, selectors)
# → {"title": "Widget Pro", "price": "$49.99", "description": "..."}

The model sees a small JSON object — not 20,000 tokens of HTML.

Managing selector libraries

Maintain a configuration file that maps domains to their selector patterns:

{
  "shop.example.com": {
    "title": "h1.product-title",
    "price": ".current-price",
    "stock": ".availability-status"
  }
}

Add validation so your pipeline alerts when selectors return empty results — that’s usually your first sign that a site has redesigned. When a selector breaks, you update one line instead of rewriting prompt logic.

Technique 3: Convert HTML to Markdown or Structured JSON First

Language models handle clean text more efficiently than HTML markup. Converting filtered HTML to markdown before passing it to the model typically cuts token count by another 30–50% — and often improves extraction accuracy at the same time.

HTML to Markdown

Python’s html2text library handles this directly:

import html2text

converter = html2text.HTML2Text()
converter.ignore_links = False
converter.ignore_images = True

markdown_content = converter.handle(filtered_html)

Instead of seeing:

<div class="product-container"><h2 class="product-title">Widget Pro</h2>
<span class="price">$49.99</span>...

The model sees:

## Widget Pro
**Price:** $49.99

Durable aluminum frame, compatible with iOS and Android.
Free shipping on orders over $50.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Same information, significantly fewer tokens, easier for the model to parse.

Using Jina Reader as a one-step shortcut

Jina Reader accepts a URL and returns clean markdown in a single API call — no preprocessing code required:

GET https://r.jina.ai/https://example.com/products/widget-pro

For articles and documentation pages especially, this replaces both the fetch and filtering steps. It’s particularly useful for quick prototyping or lower-volume use cases.

Pre-extracting to JSON

If you know your target schema, run a lightweight extraction step to generate JSON before involving the main model:

{
  "title": "Widget Pro",
  "price": 49.99,
  "currency": "USD",
  "in_stock": true,
  "features": ["iOS/Android compatible", "2-year warranty", "Free shipping"]
}

A downstream model call processing this JSON might use 150 tokens instead of 15,000. That cost difference compounds fast across hundreds of pages per run.

Technique 4: Batch Multiple Pages Into a Single Model Call

Every model API call carries overhead — the system prompt is paid for on every request. If your agent scrapes 50 pages and makes 50 model calls with the same extraction instructions, you’re paying for 50 full copies of your system prompt.

Batching combines multiple page contents into a single call.

Prompt structure for batched extraction

You will receive content from multiple web pages.
Each page is separated by --- markers, with the URL on the first line.
For each page, extract: title, price, in_stock.
Return a JSON array with one object per page.

--- PAGE START ---
URL: https://example.com/product-a
[cleaned content of page A]
--- PAGE END ---

--- PAGE START ---
URL: https://example.com/product-b
[cleaned content of page B]
--- PAGE END ---

Batch size guidelines

Content Type	Recommended Batch Size
Product listings (price, title, stock)	20–50 pages
Short reviews or ratings	10–25 pages
Full articles or long descriptions	5–10 pages

Test extraction accuracy at your chosen batch size. Models can lose precision at the end of very long contexts — start smaller and scale up if quality holds. For more on structuring effective extraction prompts, MindStudio’s prompt engineering resources cover batched instruction patterns in detail.

Where the savings come from

If your extraction instructions are 800 tokens and you make 100 calls, you’re spending 80,000 tokens on instructions alone. Batch those into 10 calls and that drops to 8,000 — a 90% reduction on system prompt overhead before you’ve optimized anything else.

Technique 5: Use Incremental Runs to Skip Unchanged Pages

Most recurring scraping workflows re-process pages that haven’t changed since the last run. If you monitor 500 product pages daily and only 30 change each day, you’re paying to re-process 470 pages of identical content.

Incremental runs solve this by checking for changes before sending anything to the model.

Content hashing

Store a hash of each page’s extracted content after each run. On the next run, compute the hash again and compare:

import hashlib

stored_hashes = {}  # In production: persist to a database

def get_content_hash(content: str) -> str:
    return hashlib.md5(content.encode()).hexdigest()

def needs_processing(url: str, current_content: str) -> bool:
    current_hash = get_content_hash(current_content)
    if stored_hashes.get(url) == current_hash:
        return False  # No change, skip
    stored_hashes[url] = current_hash
    return True  # Changed, process it

What to hash

Hash the filtered, relevant section of content — not the full raw HTML. Changes to ad banners or navigation updates shouldn’t trigger unnecessary reprocessing of unchanged product data.

For news and blog sites, check RSS feeds or sitemaps with lastmod timestamps to detect new content without fetching pages at all. Many sites update their sitemaps reliably, making change detection essentially free in terms of token costs.

Expected impact

On a stable e-commerce site, 85–95% of product pages typically don’t change between daily runs. Incremental scraping means processing 25–75 pages instead of 500 — an 85–95% reduction in model calls with no loss of data coverage.

Technique 6: Cache Extraction Results Between Runs

Caching operates at a different layer than incremental runs. Instead of skipping the scrape, caching stores the extracted results and serves them from storage when the same content is requested again within a defined time window.

When caching has the most impact

Caching is most valuable when:

Multiple downstream processes query the same scraped data independently
Your agent re-runs frequently but the data doesn’t need to be real-time
Parallel agents process overlapping URL sets

A simple cache implementation

import time

class ExtractionCache:
    def __init__(self, ttl_seconds: int = 3600):
        self.cache = {}
        self.ttl = ttl_seconds

    def get(self, url: str):
        if url in self.cache:
            result, timestamp = self.cache[url]
            if time.time() - timestamp < self.ttl:
                return result
        return None

    def set(self, url: str, result: dict):
        self.cache[url] = (result, time.time())

In production, replace the in-memory dict with Redis or a database so the cache persists across agent runs. The pattern stays the same.

Combined impact at a glance

When you layer all six techniques together, the cumulative token savings are substantial:

Technique	Typical Token Reduction
HTML filtering	50–70%
Hardcoded selectors	90%+ (replaces HTML with raw field values)
Markdown/JSON conversion	30–50% of remaining HTML
Batching	80–90% reduction in system prompt overhead
Incremental runs	85–95% fewer pages processed
Caching	Eliminates redundant calls within TTL window

Applied together on a recurring, structured scraping job, total token costs typically drop by 85–95% compared to naive implementations that send raw HTML to the model on every run.

Building Optimized Scraping Workflows in MindStudio

If you’re building AI agents with web scraping capabilities, MindStudio’s visual workflow builder lets you apply these techniques without writing a full backend.

A typical optimized scraping workflow in MindStudio looks like:

Fetch URL — a built-in HTTP request step
Filter and clean — a JavaScript or Python function step that strips noise and extracts target sections
Extract with AI — pass cleaned content to your model of choice (Claude, GPT-4o, Gemini, and 200+ others available without separate API keys)
Store results — send extracted data to Airtable, Google Sheets, Notion, a webhook, or any of 1,000+ connected tools
Schedule it — run the workflow on a timer for recurring jobs

Hermes, walked through line by line — free 1-hour workshop

The platform handles rate limiting, retries, and error management at the infrastructure layer, so your workflow logic stays focused on extraction. Most automated data pipelines built in MindStudio go from idea to deployed in under a day.

For developers building more complex scraping agents with LangChain, CrewAI, or custom frameworks, MindStudio’s Agent Skills Plugin exposes workflow execution as a typed method call:

import MindStudio from '@mindstudio-ai/agent';
const { agent } = new MindStudio();

const result = await agent.runWorkflow({
  workflowId: 'optimized-scraper',
  variables: { url: targetUrl, schema: extractionSchema }
});

The plugin handles authentication, retries, and error surfacing — your agent calls a method and gets structured results back. You can try it free at mindstudio.ai.

Frequently Asked Questions

How many tokens does a typical web page use?

A raw, unfiltered web page typically uses 10,000–40,000 tokens when passed to a language model. After HTML filtering (removing scripts, styles, and navigation), that usually drops to 2,000–8,000 tokens. After converting to markdown or using CSS selectors to pull only the target content, you can often get below 500 tokens for structured pages like product listings.

What’s the single highest-impact technique for reducing web scraping token costs?

HTML filtering has the largest single-step impact. Removing <script>, <style>, <nav>, <header>, and <footer> elements before the model sees anything typically cuts token count by 50–70% with minimal implementation effort. If you can only apply one technique, start there.

Can I apply these techniques to JavaScript-rendered pages?

Yes. For dynamically rendered pages, you first need a headless browser like Playwright or Puppeteer to execute the JavaScript and capture the final DOM. Once you have rendered HTML, all the same filtering, selector, and conversion techniques apply. Reserve headless browsers only for pages that require them — they’re slower and more resource-intensive than fetching static HTML directly.

How do I know when hardcoded selectors have stopped working?

Add validation to your extraction pipeline: if a selector returns empty results or a value that fails a basic sanity check (a price field with no digits, a title field longer than 500 characters), log an alert and fall back to sending the filtered full-page content to the model. Monitor your extraction success rates and set up threshold-based alerts so broken selectors don’t go unnoticed for days.

Does batching multiple pages hurt extraction accuracy?

Sometimes. Models can lose precision on content near the end of very long batches, especially for complex or nuanced content. For simple structured extractions like price, title, and stock status, batches of 20–50 pages typically maintain high accuracy. For longer content like reviews or articles, limit batches to 5–10 pages. Always validate accuracy at your chosen batch size before relying on it in production.

How does incremental scraping work with paginated sites?

For paginated sites, check the first page for new items before fetching subsequent pages. If everything on page 1 is already in your dataset, skip the rest. If the site has an RSS feed or a sitemap with lastmod timestamps, use those to identify new or changed content without fetching any pages directly. This is the most token-efficient approach — you detect changes without triggering any model calls at all.

Key Takeaways

Raw HTML is expensive by default — most of what’s on a page is markup and noise. A single unfiltered page can cost as much as 40,000 tokens.
Filter first — stripping scripts, styles, and navigation before the model sees anything is the fastest, highest-impact single optimization.
Hardcode selectors for recurring scrapes — if the structure is predictable, extraction doesn’t need to involve the model at all.
Convert to clean text — markdown or JSON representations of the same content typically use 30–60% fewer tokens than equivalent HTML.
Batch requests and cache results — amortizing system prompt costs and skipping unchanged pages are where most of the compounding savings come from in recurring workflows.
Layer these techniques — each one helps independently; combined, they can cut token usage by 85–95% on structured, recurring scraping jobs.

Wondering what the Hermes hype is about? Free 60-minute primer

If you want to build scraping workflows that apply these optimizations from the start, MindStudio’s no-code workflow builder is worth a look — you can have a working pipeline running in an afternoon.