Skip to main content
MindStudio
Pricing
Blog About
My Workspace

How to Convert Files to Markdown to Reduce AI Token Usage by Up to 90%

HTML, PDF, and DOCX files waste tokens on formatting noise. Converting to Markdown before feeding AI can cut token usage by 65–90% with no quality loss.

MindStudio Team RSS
How to Convert Files to Markdown to Reduce AI Token Usage by Up to 90%

Why Your Documents Are Wasting Tokens Before the AI Even Reads Them

If you’re feeding raw HTML, PDF, or DOCX files into an AI model, you’re probably paying for a lot of content the model doesn’t need. Formatting tags, metadata, redundant markup, navigation elements, repeated headers — none of it helps the model understand your document. All of it costs tokens.

Converting files to Markdown before passing them to an AI is one of the most straightforward ways to cut token usage by 65–90% with no loss in content quality. The model gets the same information. You pay significantly less for it.

This guide covers why the problem exists, how to quantify it, and exactly how to convert HTML, PDF, and DOCX files to clean Markdown using free tools.


The Token Problem with Raw Document Formats

To understand why format conversion matters, it helps to understand how token-based pricing works. You pay per input token. Every character, tag, attribute, and whitespace chunk counts toward that total.

Raw document formats are extraordinarily wasteful by this measure.

HTML: The Worst Offender

A typical web page might contain 800 words of readable content. But that same page, as raw HTML, routinely contains 10,000–25,000 characters of markup: <div> soup, inline styles, data- attributes, JavaScript snippets, class names, aria labels, link hrefs, meta tags, and more.

When you feed raw HTML to an AI model, maybe 10–15% of the tokens represent actual content. The rest is structural noise the model has to work through to find what matters.

A rough example: a news article with 500 words of body text might clock in at around 700 tokens as clean Markdown. As raw HTML scraped from the page, the same article can exceed 8,000 tokens — mostly boilerplate.

PDF: Hidden Overhead

PDFs seem cleaner, but naive extraction creates its own problems. PDF parsers often produce repeated headers and footers on every page, incorrect reading order, ligature artifacts, broken hyphenation, embedded font references, and metadata blobs. What should be 1,000 clean tokens becomes 1,800 messy ones — and that’s a well-behaved PDF.

Scanned PDFs or those with complex multi-column layouts are worse. The model spends attention on noise rather than substance.

DOCX: XML Underneath

Word documents are ZIP archives full of XML. The document.xml file inside a .docx contains every paragraph wrapped in a thicket of <w:p>, <w:r>, <w:t>, and <w:rPr> tags carrying font sizes, color values, bold flags, and spacing instructions. Extracting content correctly requires stripping all of that. Doing it naively — just passing the raw XML — is extremely expensive and often incoherent.

Even libraries that “convert” DOCX to text often carry over blank lines, redundant spacing, and orphaned style markers that inflate token counts unnecessarily.


How Much Can You Actually Save?

The savings vary by format and document complexity, but here are realistic ranges:

Source FormatTypical Token Reduction After Markdown Conversion
Raw HTML (web pages)75–90%
DOCX (Word documents)50–70%
PDF (text-based)40–65%
PDF (complex layout/scanned)20–50%

These aren’t theoretical. They’re the kind of numbers you see when you run a benchmark on real documents. A 10-page product spec in DOCX that naively extracts to 12,000 tokens often compresses to under 4,000 as clean Markdown.

At scale — if you’re running an AI agent that ingests dozens of documents per session — this compounds fast. If your AI session is draining faster than expected, document format is one of the first places to check.


How to Convert HTML to Markdown

Option 1: html2text (Python)

html2text is a Python library that converts HTML to Markdown-formatted plain text. It’s fast, well-maintained, and handles most standard HTML structures cleanly.

Install:

pip install html2text

Basic usage:

import html2text
import requests

h = html2text.HTML2Text()
h.ignore_links = False       # Keep hyperlinks
h.ignore_images = True       # Skip image tags
h.ignore_tables = False      # Preserve table structure
h.body_width = 0             # Don't wrap lines

html_content = requests.get("https://example.com/article").text
markdown = h.handle(html_content)

print(markdown)

The body_width = 0 setting is important — without it, the library wraps long lines at 78 characters, which can break the structure of code blocks and table cells.

When to use: Scraping web pages, converting CMS exports, processing HTML email templates.

Option 2: Trafilatura (for Web Content Extraction)

If you’re pulling content from web pages, trafilatura is purpose-built for extracting the main content and ignoring nav menus, sidebars, ads, and boilerplate. It outputs clean Markdown.

pip install trafilatura
import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
text = trafilatura.extract(downloaded, output_format="markdown", include_tables=True)

Trafilatura is particularly good at discarding elements that have no informational value. For AI ingestion purposes, this is often exactly what you want. It pairs well with techniques like the Scout pattern for pre-screening context before loading it into your agent.

Option 3: Pandoc (Universal Converter)

Pandoc is a command-line tool that converts between dozens of formats, including HTML to Markdown.

pandoc input.html -f html -t markdown -o output.md

Pandoc produces clean CommonMark output and handles complex documents well. It’s the right tool when you’re batch-processing files and want consistent output.


How to Convert PDF to Markdown

PDFs are harder than HTML because there’s no logical document structure — just positioned text boxes. The quality of the conversion depends heavily on the tool.

Option 1: pymupdf4llm

This is a Python library built on PyMuPDF, specifically designed to extract LLM-friendly Markdown from PDFs. It handles multi-column layouts, tables, and header detection better than most generic extractors.

pip install pymupdf4llm
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("document.pdf")
print(md_text)

For straightforward text-based PDFs, this is the fastest path to clean Markdown. For AI-driven PDF summarization, running pymupdf4llm before your summarization step meaningfully reduces costs.

Option 2: marker

Marker is an open-source Python package that uses ML models to extract structured Markdown from PDFs. It handles complex layouts, tables, code blocks, and equations better than rules-based approaches.

pip install marker-pdf
from marker.convert import convert_single_pdf
from marker.models import load_all_models

models = load_all_models()
full_text, images, metadata = convert_single_pdf("document.pdf", models)

Marker is slower than pymupdf4llm because it’s running inference to understand layout, but the output quality is significantly better for complex documents — research papers, financial reports, technical manuals.

Option 3: Pandoc (for Text-Extractable PDFs)

Pandoc can convert PDFs if they’re text-extractable (not scanned):

pandoc input.pdf -o output.md

Results vary. For clean single-column PDFs, it works well. For anything with complex structure, pymupdf4llm or marker will do better.

Option 4: LiteParse / LlamaParse

LiteParse from LlamaIndex is a document parser built for AI agent pipelines. It handles PDF, DOCX, PPTX, and more, outputting clean text or Markdown. It’s API-based and designed for production workflows where you’re processing many documents at volume.


How to Convert DOCX to Markdown

Option 1: Pandoc

Pandoc handles DOCX to Markdown conversion extremely well. Word’s heading styles become Markdown headings, bold and italic are preserved, tables convert cleanly.

pandoc input.docx -f docx -t markdown -o output.md

If you want to strip image references (since you’re feeding to an LLM that doesn’t process inline images):

pandoc input.docx -f docx -t markdown --extract-media=./images -o output.md

This is the simplest and most reliable approach for most DOCX files.

Option 2: mammoth (Python)

Mammoth is a Python library that converts DOCX to clean HTML or Markdown. It’s useful when you need programmatic control over the conversion, such as mapping custom Word styles to specific Markdown elements.

pip install mammoth
import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_markdown(docx_file)
    markdown = result.value

Mammoth works well for standard corporate documents but may struggle with heavily styled files that use custom templates.


Cleaning Up Your Markdown Output

Raw conversion output often needs a cleanup pass before it’s truly optimal for AI ingestion. Common issues:

  • Excessive blank lines — Many converters insert two or three blank lines between paragraphs. One is enough.
  • Orphaned headers — Navigation elements sometimes get converted as Markdown headings. Remove them.
  • Redundant HTML entities&nbsp;, &amp;, &lt; should be cleaned up.
  • Tables with empty cells — Often a sign of layout-tables being converted. Remove or flatten.

A simple Python cleanup function:

import re

def clean_markdown(md: str) -> str:
    # Collapse 3+ blank lines into 2
    md = re.sub(r'\n{3,}', '\n\n', md)
    # Remove HTML entities
    md = md.replace('&nbsp;', ' ').replace('&amp;', '&').replace('&lt;', '<').replace('&gt;', '>')
    # Strip leading/trailing whitespace
    return md.strip()

This kind of cleanup pass can shave another 5–15% off token counts on top of the conversion savings.


Batch Processing: Converting Files at Scale

If you’re building an AI agent that regularly processes documents, you want this pipeline automated. Here’s a minimal batch converter in Python that handles all three formats:

import os
import subprocess
import pymupdf4llm
import html2text

def convert_to_markdown(filepath: str) -> str:
    ext = os.path.splitext(filepath)[1].lower()
    
    if ext == ".pdf":
        return pymupdf4llm.to_markdown(filepath)
    
    elif ext == ".docx":
        result = subprocess.run(
            ["pandoc", filepath, "-f", "docx", "-t", "markdown"],
            capture_output=True, text=True
        )
        return result.stdout
    
    elif ext in [".html", ".htm"]:
        with open(filepath, "r", encoding="utf-8") as f:
            html = f.read()
        h = html2text.HTML2Text()
        h.ignore_images = True
        h.body_width = 0
        return h.handle(html)
    
    else:
        raise ValueError(f"Unsupported format: {ext}")

For production pipelines, you’d add error handling, logging, and caching — but this is the core conversion logic. Run each document through this before inserting it into your AI context, and the token savings compound across every call.


Where This Fits in Your AI Workflow

Format conversion isn’t just a cost-saving trick. It’s a form of context hygiene that affects output quality too.

When a model reads clean Markdown, it spends its attention on content. When it reads raw HTML or bloated PDF extraction, it’s processing noise alongside signal. The context window is finite — every token of formatting noise is a token not available for actual reasoning.

This connects to a broader principle: the cleaner and more precise your inputs, the better your outputs. If you’re building effective prompts for AI agents, document quality matters as much as prompt quality. The model only knows what you give it.

Markdown-based knowledge stores also have a longer-term structural advantage. If you’re deciding between a Markdown knowledge base and a vector database for your retrieval setup, the comparison between LLM wikis and RAG is worth reading — format plays a direct role in which approach makes sense.

For agent workflows that ingest many documents across a session, combining format conversion with progressive disclosure techniques can dramatically extend how much useful work you can do before hitting context limits.


How Remy Handles Document Context

Remy is built on the same principle: the source of truth should be as clean and precise as possible. When you write a spec in Remy, it’s annotated Markdown — lightweight, structured, readable by both humans and agents. Nothing extraneous.

That same philosophy applies to document inputs. When you’re building an app in Remy that needs to ingest documents — say, a report analyzer, a policy search tool, or a contract reviewer — the conversion step belongs in the pipeline, not as an afterthought.

Remy compiles your spec into a full-stack app with a real backend, database, and auth. If your app processes documents, the pre-processing logic is part of the backend method you define in the spec. Clean input handling is a first-class concern, not a bolt-on.

You can try Remy at mindstudio.ai/remy and see how spec-driven development handles document-heavy workflows with less friction than wiring everything together manually.


Frequently Asked Questions

Does converting to Markdown affect AI output quality?

No — in most cases it improves it. The model receives cleaner input without spending attention on formatting noise. The content is identical; the structure is more interpretable. For tasks like summarization, extraction, and classification, Markdown-formatted input consistently performs as well or better than raw HTML or extracted PDF text.

Which file format wastes the most tokens?

Raw HTML is typically the worst offender. A web page with 500 words of content can easily contain 5,000–20,000 tokens of markup, scripts, and boilerplate. Converting to Markdown before passing to an AI model often yields 80–90% token reduction for HTML. DOCX and PDF are generally less extreme but still significant — typically 50–70% reduction.

Can I convert files to Markdown without writing code?

Yes. Pandoc is a free command-line tool that converts HTML, DOCX, and other formats to Markdown without any programming. For web pages, browser extensions like MarkDownload can save pages directly as Markdown. For PDFs, tools like Mathpix (for technical documents) and online converters handle conversion through a UI.

Does this work with all AI models?

Yes. Markdown is plain text with minimal syntax characters. Every major language model — GPT-4, Claude, Gemini, Llama, Mistral — handles Markdown input well. The token savings are real regardless of which model you’re using, because all of them use similar tokenization schemes where markup tags add substantial token overhead.

What about documents with important formatting like tables or code?

Markdown preserves tables using pipe syntax (| col1 | col2 |) and code blocks using backtick fences. Most converters handle these correctly. Tables from DOCX and structured PDF tables convert reliably with Pandoc and pymupdf4llm. The key is to verify that critical structure survives conversion before deploying automated pipelines at scale.

Is there a downside to converting everything to Markdown?

A few edge cases: documents that rely heavily on visual layout (spatial positioning, colors, complex nested tables) may lose some fidelity in conversion. Scanned PDFs require OCR before Markdown conversion is even possible. And for documents where the original formatting has legal significance — signed contracts, regulatory filings — you’d typically want to preserve the source file alongside any Markdown version.


Key Takeaways

  • Converting HTML, PDF, and DOCX files to Markdown before AI ingestion can reduce token usage by 65–90%, with no content loss.
  • HTML is the most wasteful format — raw markup can be 10x the size of actual content by token count.
  • The best tools: trafilatura or html2text for HTML, pymupdf4llm or marker for PDF, pandoc or mammoth for DOCX.
  • A cleanup pass after conversion (collapsing whitespace, removing HTML entities) adds another 5–15% reduction.
  • Clean input improves output quality, not just cost — models reason better when they’re not processing noise.
  • Automate conversion early in your pipeline. It’s a small investment that pays off on every API call.

If you’re building document-processing workflows and want the infrastructure handled — backend, database, auth, deployment — try Remy and define your pipeline in a spec rather than wiring it together from scratch.

Presented by MindStudio

No spam. Unsubscribe anytime.