How to Build a Web Scraping Agent That Finds Leads and Contact Info
Use Claude Code and Playwright to search the web, visit sites, and extract contact information automatically. A practical guide with real examples.
What This Agent Actually Does
Building a web scraping agent that finds leads and extracts contact information is one of the most practical automation projects you can take on. The core idea is simple: give Claude Code a search query, let it use Playwright to visit sites, and have it pull out names, emails, phone numbers, and other contact details — all without you manually browsing through dozens of pages.
This guide walks through how to build that agent from scratch. You’ll end up with something that can take a list of target companies or keywords, search Google or specialized directories, visit each result, extract structured contact data, and save it all to a file or spreadsheet.
It’s not magic. But it does work, and it can replace hours of manual prospecting per week.
What You’ll Need Before Starting
Before writing a single line of code, make sure you have the following in place:
- Node.js 18+ installed locally
- Claude Code installed and authenticated (requires an Anthropic API key)
- Playwright installed (
npm install playwright) - A clear idea of what you’re scraping — company websites, business directories, LinkedIn alternatives, or niche listing sites
- Basic familiarity with how Playwright selectors work
If you’re new to browser automation with Claude, the guide on using browser automation with Claude Code for web scraping and form filling covers the foundational concepts before you get into lead-specific workflows.
You’ll also want to think through scope early. Are you scraping 50 companies or 5,000? The architecture looks different at each scale. Start small, validate the output quality, then expand.
Step 1: Set Up Your Project Structure
Start with a clean directory. Here’s a simple structure that keeps things organized:
lead-scraper/
├── index.js # Main entry point
├── scraper.js # Playwright browser logic
├── extractor.js # Contact parsing logic
├── storage.js # CSV/JSON output
├── targets.txt # Input list of domains or queries
└── results/ # Output folder
Install your dependencies:
npm init -y
npm install playwright @anthropic-ai/sdk csv-writer dotenv
npx playwright install chromium
Create a .env file with your API key:
ANTHROPIC_API_KEY=your_key_here
Now set up the main Claude Code session. This is where you instruct the agent on what it needs to do. Claude Code will orchestrate the browser — launching pages, deciding what to click, reading content, and passing structured results back to your script.
Step 2: Define the Search Strategy
Your agent needs a starting point. There are two common approaches:
Approach A: Start from a list of known domains You already have company names or URLs. The agent visits each one and extracts contact info.
Approach B: Start from a search query The agent searches Google, Bing, or a directory like Clutch or Crunchbase, collects URLs from the results, then visits each one.
Approach B is more powerful but more complex. Let’s build it.
Here’s the search scraping logic in scraper.js:
const { chromium } = require('playwright');
async function searchForLeads(query, maxResults = 20) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
// Set a realistic user agent
await page.setExtraHTTPHeaders({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
});
const searchUrl = `https://www.bing.com/search?q=${encodeURIComponent(query)}`;
await page.goto(searchUrl, { waitUntil: 'networkidle' });
// Extract result URLs
const links = await page.$$eval('h2 a', els =>
els.map(el => el.href).filter(href => href.startsWith('http'))
);
await browser.close();
return links.slice(0, maxResults);
}
module.exports = { searchForLeads };
This gets you a list of URLs. The next step is visiting each one and pulling out contact data.
Step 3: Build the Contact Extraction Layer
Contact information usually lives in a few predictable places: the homepage, a /contact page, an /about page, or a site footer. Your agent needs to check all of these.
Here’s the extraction function in extractor.js:
const { chromium } = require('playwright');
const EMAIL_REGEX = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;
const PHONE_REGEX = /(\+?\d{1,3}[\s\-.]?)?\(?\d{3}\)?[\s\-.]?\d{3}[\s\-.]?\d{4}/g;
async function extractContactInfo(baseUrl) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
const results = {
url: baseUrl,
emails: new Set(),
phones: new Set(),
companyName: '',
contactPage: ''
};
const pagesToCheck = [
baseUrl,
`${baseUrl}/contact`,
`${baseUrl}/contact-us`,
`${baseUrl}/about`,
`${baseUrl}/about-us`
];
for (const url of pagesToCheck) {
try {
await page.goto(url, { waitUntil: 'networkidle', timeout: 10000 });
const bodyText = await page.innerText('body');
const emails = bodyText.match(EMAIL_REGEX) || [];
const phones = bodyText.match(PHONE_REGEX) || [];
emails.forEach(e => results.emails.add(e.toLowerCase()));
phones.forEach(p => results.phones.add(p.trim()));
// Try to get company name from title or OG tags
if (!results.companyName) {
results.companyName = await page.title();
}
} catch (e) {
// Page not found or timeout — skip
}
}
await browser.close();
return {
...results,
emails: [...results.emails],
phones: [...results.phones]
};
}
module.exports = { extractContactInfo };
This visits up to five pages per domain and collects every email and phone number it finds using regex patterns.
The regex approach works well for direct extraction. For more structured data — like a “Meet the Team” page with named contacts — you’ll want to add Claude’s language model to parse and classify what it finds. That’s covered in Step 5.
Step 4: Save Results to CSV
Once you have extracted data, write it to a CSV that’s easy to open in Google Sheets, HubSpot, or whatever CRM you use.
const createCsvWriter = require('csv-writer').createObjectCsvWriter;
const csvWriter = createCsvWriter({
path: 'results/leads.csv',
header: [
{ id: 'url', title: 'Website' },
{ id: 'companyName', title: 'Company Name' },
{ id: 'emails', title: 'Emails' },
{ id: 'phones', title: 'Phone Numbers' }
]
});
async function saveResults(leads) {
const rows = leads.map(lead => ({
url: lead.url,
companyName: lead.companyName,
emails: lead.emails.join(', '),
phones: lead.phones.join(', ')
}));
await csvWriter.writeRecords(rows);
console.log(`Saved ${rows.length} leads to results/leads.csv`);
}
module.exports = { saveResults };
For more advanced storage — like writing directly to a database or syncing with a CRM — check out the lead enrichment tool workflow tutorial, which shows how to structure data pipelines for outbound sales workflows.
Step 5: Add Claude to Classify and Enrich the Data
Raw regex extraction misses a lot. Pages with obfuscated emails (like info [at] company [dot] com), dynamic content loaded by JavaScript, or structured team pages require a smarter approach.
This is where Claude earns its place. After scraping the page text, you can pass it to Claude with a simple prompt:
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic();
async function enrichWithClaude(pageText, url) {
const message = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Extract contact information from this web page content. Return a JSON object with:
- companyName (string)
- emails (array of strings)
- phones (array of strings)
- contactPerson (string, if a specific contact name is mentioned)
- title (string, their job title if mentioned)
Only include information that is explicitly present. Do not guess.
Page URL: ${url}
Page content:
${pageText.slice(0, 4000)}`
}
]
});
try {
const jsonMatch = message.content[0].text.match(/\{[\s\S]*\}/);
return jsonMatch ? JSON.parse(jsonMatch[0]) : {};
} catch {
return {};
}
}
Slicing the page content to 4,000 characters keeps token usage reasonable. If you’re scraping hundreds of sites, token costs add up fast. The guide on optimizing web scraping skills for AI agents has six specific techniques for keeping those costs under control.
Step 6: Wire It All Together
Here’s the index.js that runs the full pipeline:
const { searchForLeads } = require('./scraper');
const { extractContactInfo } = require('./extractor');
const { enrichWithClaude } = require('./claude');
const { saveResults } = require('./storage');
const { chromium } = require('playwright');
const fs = require('fs');
async function run() {
// Read search queries from file
const queries = fs.readFileSync('targets.txt', 'utf-8')
.split('\n')
.filter(Boolean);
const allLeads = [];
for (const query of queries) {
console.log(`Searching: ${query}`);
const urls = await searchForLeads(query, 10);
for (const url of urls) {
console.log(` Scraping: ${url}`);
try {
const contact = await extractContactInfo(url);
// Only enrich with Claude if we need more structure
if (contact.emails.length === 0) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { timeout: 10000 });
const text = await page.innerText('body');
await browser.close();
const enriched = await enrichWithClaude(text, url);
Object.assign(contact, enriched);
}
allLeads.push(contact);
} catch (e) {
console.error(` Failed: ${url} — ${e.message}`);
}
}
}
await saveResults(allLeads);
}
run();
Add your search queries to targets.txt, one per line:
digital marketing agencies Chicago
SaaS companies hiring SDRs
B2B software companies series A
Run it with node index.js and the agent will work through each query, visit each result, extract what it can, and save everything to CSV.
Handling Anti-Scraping Measures
Most business websites don’t aggressively block scrapers, but some do. Here are the common issues and how to handle them:
Rate limiting and blocks
Add delays between requests to avoid triggering rate limits:
await page.waitForTimeout(2000 + Math.random() * 3000);
Randomizing the delay makes your agent look more like a human browsing at varying speeds.
JavaScript-rendered content
Some contact pages render their content dynamically. The waitUntil: 'networkidle' option in Playwright handles most cases. For particularly stubborn pages, try:
await page.waitForSelector('main', { timeout: 5000 });
CAPTCHAs and login walls
If a site requires login or throws a CAPTCHA, your agent can’t get through programmatically. The practical answer: skip those sites and move on. Log them separately for manual review.
For social platforms like LinkedIn, the dynamics are different. Read about bypassing browser automation blocks on LinkedIn and Instagram if that’s part of your target channel.
robots.txt compliance
Always check robots.txt before scraping a site at scale. Most business directories explicitly allow crawling their public-facing pages. Scraping behind authentication or at abusive rates is a separate matter — don’t do it.
Scaling Up: Parallel Execution
Once your basic agent works, the natural next step is speed. Running requests sequentially means each site adds latency. If you’re processing 500 URLs, that could take hours.
The fix is parallel execution — running multiple browser instances at the same time. Here’s a simple batching approach:
const pLimit = require('p-limit');
const limit = pLimit(5); // Max 5 concurrent browsers
const results = await Promise.all(
urls.map(url => limit(() => extractContactInfo(url)))
);
Install p-limit with npm install p-limit. Setting concurrency to 5 gives you 5x throughput without hammering any single server.
For more sophisticated parallel setups — including managing multiple Claude instances simultaneously — the parallel browser agents guide covers the architecture in detail.
Common Mistakes and How to Avoid Them
Scraping the wrong pages
Not every result from a search query is a lead. Filter out irrelevant domains early:
const SKIP_DOMAINS = ['wikipedia.org', 'linkedin.com', 'youtube.com', 'reddit.com'];
const filtered = urls.filter(url =>
!SKIP_DOMAINS.some(domain => url.includes(domain))
);
Treating all emails equally
Info@ and support@ addresses are rarely useful for outreach. Filter them:
const GENERIC_PREFIXES = ['info', 'support', 'hello', 'contact', 'admin', 'noreply'];
const valuableEmails = emails.filter(email => {
const prefix = email.split('@')[0].toLowerCase();
return !GENERIC_PREFIXES.includes(prefix);
});
Not deduplicating
If you run multiple queries, you’ll often hit the same company site twice. Keep a visited URL set:
const visited = new Set();
if (!visited.has(url)) {
visited.add(url);
// proceed with scraping
}
Collecting data you can’t use
More data isn’t always better. A list of 2,000 companies with no emails is less useful than 200 companies with verified contact names and addresses. Adjust your stop conditions accordingly — the web scraping skill guide on token reduction and stop conditions explains how to build agents that know when to stop rather than collecting indefinitely.
Turning Raw Leads into Outreach
Extracting contact data is step one. The output is only useful if you do something with it. A few common next steps:
- Import into a CRM — Most CRMs accept CSV imports. Map your columns to the right fields.
- Run verification — Use an email verification service (like NeverBounce or ZeroBounce) to check that addresses are deliverable before sending.
- Personalize at scale — Pass your contact list to a personalization workflow that researches each company and drafts a tailored first line. If you’re building that layer, the guide on personalized outreach workflows is worth reading.
- Feed into an AI sales agent — Sales teams using AI agents increasingly run lead data through automated qualification before any human touches it.
If your goal is a full outbound pipeline — research, scraping, enrichment, qualification, and outreach — it’s worth thinking about the architecture as a multi-skill agent rather than a single script. The AI company research agent with Claude Code and ClickUp is a good example of how those pieces fit together.
Where Remy Fits In
If you find yourself wanting to productize this — turn the scraping agent into something your team can trigger via a dashboard, log into, share results through, and manage without touching the terminal — that’s where Remy becomes relevant.
Remy compiles full-stack applications from a spec: backend, database, auth, frontend, deployment. Instead of maintaining a pile of scripts and a shared spreadsheet, you describe your lead scraping app in a spec document, and Remy builds the actual product around it.
That might look like: a web interface where team members can submit search queries, a results page that shows extracted contacts with status filters, and an export button that pulls clean CSVs. All of that is a full-stack application. You can describe it in annotated prose and have Remy generate it, rather than wiring together five separate tools.
The spec stays in sync with the code, so as your requirements evolve — add email verification, add CRM sync, add a qualification score — you update the spec and recompile. You don’t maintain a chaotic tangle of scripts.
Try it at mindstudio.ai/remy.
Frequently Asked Questions
Is it legal to scrape websites for contact information?
It depends on the jurisdiction, the site, and what you do with the data. Scraping publicly available business contact information — emails listed on a company’s website, phone numbers on a contact page — is generally considered legal in most countries, including the US, as long as you’re not bypassing authentication, violating robots.txt in a harmful way, or collecting personal data regulated under GDPR without a lawful basis. Always check the terms of service for specific sites and consult a lawyer if you’re operating at significant scale or targeting European users.
How accurate is regex-based email extraction?
Reasonably accurate for standard email formats, but it misses obfuscated addresses (e.g., name [at] company [dot] com) and emails embedded in images. Combining regex with Claude’s language understanding closes most of those gaps. Expect around 70–80% coverage on a typical business website, with the remaining 20–30% requiring manual follow-up or a different source.
What’s the best source for B2B leads besides company websites?
Business directories like Clutch, G2, Crunchbase, and AngelList often have more structured data than individual company sites. Industry association member directories are also underused. For very specific niches, niche job boards, conference speaker lists, and award program pages are worth targeting — they often include names and company affiliations that you can then look up directly.
How do I avoid getting my IP blocked?
Use reasonable delays between requests (2–5 seconds), rotate user agents, and don’t hammer the same domain repeatedly. For large-scale scraping, consider using a rotating proxy service. Most small-scale lead generation (hundreds of sites per day) won’t trigger blocks on typical business websites.
Can this agent work with LinkedIn?
LinkedIn actively blocks automated scraping and enforces this aggressively. Their robots.txt disallows most crawler activity, and their terms of service prohibit scraping. There are compliant ways to use LinkedIn for prospecting — the Claude Code LinkedIn outreach guide covers what’s possible within those boundaries.
How do I handle sites that require JavaScript to render their content?
Playwright handles JavaScript-heavy sites natively because it runs a real browser engine. Just make sure you’re using waitUntil: 'networkidle' or waiting for specific elements to appear before extracting text. This covers most SPAs and dynamic contact pages. The guide on automating browser tasks with Claude Code and Playwright CLI covers advanced waiting strategies.
Key Takeaways
- A web scraping agent that finds leads combines Playwright for browser automation, regex for contact extraction, and Claude for intelligent parsing and enrichment.
- Structure your pipeline in layers: search → collect URLs → visit pages → extract data → save results.
- Handle edge cases early: skip irrelevant domains, deduplicate, filter generic email addresses.
- Use parallel execution (
p-limit) to scale from hundreds to thousands of sites without proportionally increasing runtime. - Claude’s language understanding fills the gaps that regex misses — obfuscated emails, structured team pages, and ambiguous content.
- For teams that need a real interface around this workflow rather than a collection of scripts, Remy lets you build the full product from a spec.