import asyncio
import json
import os
import sys
sys.path.insert(0, "../agents")
from dotenv import load_dotenv
load_dotenv("../.env")
# -- Backend selection --------------------------------------------------------
# Set to "ollama" to use your local/remote Ollama instance instead of OpenRouter
# Can also set LLM_BACKEND=ollama in .env
os.environ.setdefault("LLM_BACKEND", "openrouter") # change to "ollama" to switch
from llm import get_backend, get_model
# Import the tool functions directly
# (same functions the MCP server exposes - no server process needed for testing)
from mcp_server import (
fetch_page_structure,
generate_scraper_config,
test_selector,
run_scrape,
load_config,
compare_rendering,
)
SSR_PRODUCTS = "http://localhost:8001/products"
CSR_PRODUCTS = "http://localhost:8002/products"
SSR_JOBS = "http://localhost:8003/jobs"
print(f"Backend : {get_backend()}")
print(f"Model : {get_model()}")
print("Tools imported")Chapter 16: The Scraper as an MCP Server
The Model Context Protocol (MCP) is an open standard that lets AI assistants call external tools as first-class operations. Instead of copy-pasting URLs and HTML into a chat window, an MCP-enabled agent can call fetch_page_structure(), test_selector(), and run_scrape() directly, the same way it calls a calculator or a file system tool.
This notebook: 1. Explains the 6 tools the scraper MCP server exposes 2. Tests each tool against the demo sites 3. Shows the full autonomous workflow: URL → config → scrape 4. Shows how to connect the server to Claude Code
16.1 Tool Inventory
| Tool | What it does |
|---|---|
fetch_page_structure |
Fetches a URL, returns compact HTML tag tree + CSR detection |
test_selector |
Tests a CSS selector against a live page, returns count + previews |
compare_rendering |
Fetches SSR and CSR versions side-by-side to show the difference |
load_config |
Loads a pre-built demo config by name |
run_scrape |
Executes a full scrape with a config JSON, returns records |
generate_scraper_config |
LLM-generates a config from a URL (uses OpenRouter/Ollama) |
All tools return JSON strings: a consistent format that any LLM can parse and reason about.
16.2 fetch_page_structure
The first thing an AI agent does when given an unknown URL: understand its structure. This tool returns a condensed tag tree: enough context for the LLM to infer selectors without overflowing its context window with 300KB of raw HTML.
result = await fetch_page_structure(SSR_PRODUCTS)
r = json.loads(result)
print(f"Status: {r['status_code']}")
print(f"Is CSR: {r['is_csr']}")
print(f"Visible text: {r['visible_text_length']} chars")
if r['csr_note']:
print(f"Note: {r['csr_note']}")
print()
print("Structure (first 500 chars):")
print(r['structure'][:500])[04/11/26 13:58:20] INFO HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK" _client.py:1740
Status: 200
Is CSR: False
Visible text: 2222 chars
Structure (first 500 chars):
<body> 'ShopSphereAll ProductsSSRProducts(50 items)CategoriesAll Pro'
<header class="site-header"> 'ShopSphereAll ProductsSSR'
<div class="container header-inner"> 'ShopSphereAll ProductsSSR'
<a class="logo" href="/"> 'ShopSphere'
<nav class="site-nav"> 'All Products'
<a href="/"> 'All Products'
<div class="header-badge"> 'SSR'
<main class="container"> 'Products(50 items)CategoriesAll ProductsAccessoriesAudioBusi'
<div class="page-header"> 'Products(50 items
# CSR site - agent sees an empty shell and knows to use Playwright
result = await fetch_page_structure(CSR_PRODUCTS)
r = json.loads(result)
print(f"Status: {r['status_code']}")
print(f"Is CSR: {r['is_csr']}")
print(f"Visible text: {r['visible_text_length']} chars")
print(f"Note: {r['csr_note']}")[04/11/26 13:58:31] INFO HTTP Request: GET http://localhost:8002/products "HTTP/1.1 200 OK" _client.py:1740
Status: 200
Is CSR: True
Visible text: 159 chars
Note: Page appears client-side rendered - use render_mode=playwright
16.3 test_selector
Before committing a selector to a config, verify it matches what you expect. An agent uses this to validate selectors it infers from the page structure.
result = await test_selector(SSR_PRODUCTS, "article.product-card")
r = json.loads(result)
print(f"Selector: {r['selector']}")
print(f"Matches: {r['count']}")
print(f"Note: {r['note']}")
print()
print("Previews:")
for preview in r['previews']:
print(f" text: {preview['text'][:60]!r}")[04/11/26 13:58:40] INFO HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK" _client.py:1740
Selector: article.product-card
Matches: 10
Note: 10 elements matched
Previews:
text: 'LaptopsMacBook Pro 14"$1999.004.8/5(2341 reviews)In StockThe'
text: 'LaptopsDell XPS 15$1499.004.6/5(1876 reviews)In StockPremium'
text: 'LaptopsLenovo ThinkPad X1 Carbon Gen 11$1299.004.7/5(1543 re'
text: 'LaptopsASUS ZenBook Pro Duo 15$1799.004.4/5(876 reviews)Out '
text: 'SmartphonesiPhone 15 Pro$999.004.9/5(5432 reviews)In StockTi'
# Wrong selector - agent gets immediate feedback to try something else
result = await test_selector(SSR_PRODUCTS, "div.product")
r = json.loads(result)
print(f"Selector '{r['selector']}': {r['note']}")[04/11/26 13:58:46] INFO HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK" _client.py:1740
Selector 'div.product': No matches found
16.4 compare_rendering
Shows an AI agent (or a human) exactly why CSR sites need Playwright: the same URL returns 10 product cards on SSR and 0 on CSR with a static fetch.
result = await compare_rendering(SSR_PRODUCTS, CSR_PRODUCTS)
r = json.loads(result)
print("SSR:")
print(f" Visible text: {r['ssr']['visible_text_chars']} chars")
print(f" Cards found: {r['ssr']['cards_found']}")
print(f" Scrapeable: {r['ssr']['scrapeable_without_js']}")
print()
print("CSR:")
print(f" Visible text: {r['csr']['visible_text_chars']} chars")
print(f" Cards found: {r['csr']['cards_found']}")
print(f" Scrapeable: {r['csr']['scrapeable_without_js']}")
print(f" Note: {r['csr']['note']}")
print()
print("Verdict:", r['verdict'])[04/11/26 13:58:51] INFO HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK" _client.py:1740
INFO HTTP Request: GET http://localhost:8002/products "HTTP/1.1 200 OK" _client.py:1740
SSR:
Visible text: 2222 chars
Cards found: 10
Scrapeable: True
CSR:
Visible text: 159 chars
Cards found: 0
Scrapeable: False
Note: CSR requires Playwright. Static fetch returns only the empty HTML shell.
Verdict: SSR: direct scraping works. CSR: JavaScript must execute first - use render_mode=playwright in config.
16.5 load_config + run_scrape
For known demo sites, pre-built configs are available. An agent can load one and run a scrape without generating a new config from scratch.
# Load a pre-built config
result = load_config("shopsphere-ssr")
config = json.loads(result)
print(f"Name: {config['name']}")
print(f"Render mode: {config['render_mode']}")
print(f"Fields: {list(config['fields'].keys())}")
# List all available configs
result = load_config("nonexistent")
r = json.loads(result)
print(f"Available: {r['available']}")Name: ShopSphere SSR
Render mode: static
Fields: ['title', 'price', 'rating', 'review_count', 'category', 'in_stock', 'description', 'product_id', 'tags', 'image', 'gallery_images']
Available: ['shopsphere-ssr', 'jobhive-csr', 'shopsphere-csr', 'jobhive-ssr']
# Run a scrape with the loaded config
result = await run_scrape(json.dumps(config), max_items=3)
r = json.loads(result)
print(f"Records scraped: {r['count']}")
print(f"Render mode: {r['render_mode']}")
print()
for rec in r['records']:
print(f" {rec['title']} - ${rec['price']} - {rec['category']} - rating: {rec['rating']}")[04/11/26 13:59:24] INFO HTTP Request: GET http://localhost:8001/products?page=1 "HTTP/1.1 200 _client.py:1740 OK"
INFO HTTP Request: GET http://localhost:8001/products/macbook-pro-14 _client.py:1740 "HTTP/1.1 200 OK"
INFO HTTP Request: GET http://localhost:8001/products/dell-xps-15 "HTTP/1.1 _client.py:1740 200 OK"
INFO HTTP Request: GET _client.py:1740 http://localhost:8001/products/lenovo-thinkpad-x1-carbon "HTTP/1.1 200 OK"
Records scraped: 3
Render mode: static
MacBook Pro 14" - $1999.00 - Laptops - rating: 4.8
Dell XPS 15 - $1499.00 - Laptops - rating: 4.6
Lenovo ThinkPad X1 Carbon Gen 11 - $1299.00 - Laptops - rating: 4.7
16.6 generate_scraper_config
Given only a URL, the LLM fetches the page structure and generates a complete config. This is the tool an agent uses when encountering a new, unknown site.
# Generate a config for JobHive SSR from scratch
result = await generate_scraper_config(SSR_JOBS)
config = json.loads(result)
print(f"render_mode: {config.get('render_mode')}")
print(f"fields: {list(config.get('fields', {}).keys())}")
print()
print("link_selector:", config.get('listing', {}).get('link_selector'))
print("url_template: ", config.get('sources', [{}])[0].get('url_template'))Fetching listing page http://localhost:8003/jobs
[04/11/26 13:59:31] INFO HTTP Request: GET http://localhost:8003/jobs "HTTP/1.1 200 OK" _client.py:1025
Fetching sample detail page http://localhost:8003/jobs/senior-backend-engineer-stripe
INFO HTTP Request: GET _client.py:1025 http://localhost:8003/jobs/senior-backend-engineer-stripe "HTTP/1.1 200 OK"
Detail page summary (45 lines)
Listing page summary (47 lines)
Asking LLM to generate config...
[04/11/26 13:59:33] INFO HTTP Request: POST https://openrouter.ai/api/v1/chat/completions _client.py:1025 "HTTP/1.1 200 OK"
render_mode: static
fields: ['title', 'company', 'location', 'salary', 'job_type', 'tags', 'department', 'experience', 'job_id']
link_selector: article.job-card a.job-link
url_template: http://localhost:8003/jobs?page={n}
# Run a scrape with the generated config
result = await run_scrape(json.dumps(config), max_items=3)
r = json.loads(result)
print(f"Records scraped: {r['count']}")
print()
for rec in r['records']:
title = rec.get('title', 'N/A')
company = rec.get('company', 'N/A')
salary = rec.get('salary', 'N/A')
print(f" {title} @ {company} - {salary}")[04/11/26 13:59:56] INFO HTTP Request: GET http://localhost:8003/jobs?page=1 "HTTP/1.1 200 OK" _client.py:1740
INFO HTTP Request: GET _client.py:1740 http://localhost:8003/jobs/senior-backend-engineer-stripe "HTTP/1.1 200 OK"
INFO HTTP Request: GET _client.py:1740 http://localhost:8003/jobs/staff-software-engineer-airbnb "HTTP/1.1 200 OK"
INFO HTTP Request: GET _client.py:1740 http://localhost:8003/jobs/senior-frontend-engineer-figma "HTTP/1.1 200 OK"
Records scraped: 3
Senior Backend Engineer @ Stripe - $180,000 - $240,000
Staff Software Engineer @ Airbnb - $200,000 - $280,000
Senior Frontend Engineer @ Figma - $160,000 - $220,000
16.7 The Full Autonomous Workflow
Putting it together: an AI agent receives a URL and returns structured data without any human intervention.
async def autonomous_scrape(url: str, max_items: int = 5) -> list:
"""
Full MCP tool chain:
1. fetch_page_structure → understand the page
2. generate_scraper_config → LLM writes the config
3. run_scrape → execute and return records
"""
print(f"Step 1: analysing {url}")
structure = json.loads(await fetch_page_structure(url))
print(f" is_csr={structure['is_csr']}, text_len={structure['visible_text_length']}")
print("Step 2: generating config via LLM...")
config = json.loads(await generate_scraper_config(url))
print(f" render_mode={config.get('render_mode')}, fields={list(config.get('fields',{}).keys())}")
print(f"Step 3: scraping {max_items} items...")
output = json.loads(await run_scrape(json.dumps(config), max_items=max_items))
print(f" scraped {output['count']} records")
return output['records']
records = await autonomous_scrape(SSR_PRODUCTS, max_items=3)
print()
print("Results:")
for rec in records:
print(f" {rec.get('title')} - {rec.get('price')} - {rec.get('category')}")Step 1: analysing http://localhost:8001/products
[04/11/26 14:01:02] INFO HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK" _client.py:1740
is_csr=False, text_len=2222
Step 2: generating config via LLM...
Fetching listing page http://localhost:8001/products
INFO HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK" _client.py:1025
Fetching sample detail page http://localhost:8001/products/macbook-pro-14
INFO HTTP Request: GET http://localhost:8001/products/macbook-pro-14 _client.py:1025 "HTTP/1.1 200 OK"
Detail page summary (45 lines)
Listing page summary (41 lines)
Asking LLM to generate config...
[04/11/26 14:01:03] INFO HTTP Request: POST https://openrouter.ai/api/v1/chat/completions _client.py:1025 "HTTP/1.1 200 OK"
[04/11/26 14:01:06] INFO HTTP Request: GET http://localhost:8001/products?page=1 "HTTP/1.1 200 _client.py:1740 OK"
INFO HTTP Request: GET http://localhost:8001/products/macbook-pro-14 _client.py:1740 "HTTP/1.1 200 OK"
INFO HTTP Request: GET http://localhost:8001/products/dell-xps-15 "HTTP/1.1 _client.py:1740 200 OK"
INFO HTTP Request: GET _client.py:1740 http://localhost:8001/products/lenovo-thinkpad-x1-carbon "HTTP/1.1 200 OK"
render_mode=static, fields=['title', 'price', 'rating', 'category', 'in_stock', 'image', 'description', 'tags', 'specs']
Step 3: scraping 3 items...
scraped 3 records
Results:
MacBook Pro 14" - $1999.00 - Laptops
Dell XPS 15 - $1499.00 - Laptops
Lenovo ThinkPad X1 Carbon Gen 11 - $1299.00 - Laptops
16.8 Connecting to Claude Code
To use these tools from Claude Code, register the MCP server in your Claude Code settings. Claude Code will start the server automatically and make the tools available in every conversation.
Add to ~/.claude/claude_desktop_config.json (or via claude mcp add):
{
"mcpServers": {
"scraper": {
"command": "/path/to/scrapping/.venv/bin/python",
"args": ["/path/to/scrapping/agents/mcp_server.py"],
"env": {
"LLM_BACKEND": "openrouter",
"OPENROUTER_API_KEY": "sk-or-v1-..."
}
}
}
}Or add via CLI:
claude mcp add scraper \
/path/to/scrapping/.venv/bin/python \
/path/to/scrapping/agents/mcp_server.pyOnce connected, you can ask Claude:
“Scrape the first 10 products from http://localhost:8001/products and give me a table of names, prices, and ratings.”
Claude will call fetch_page_structure → generate_scraper_config → run_scrape autonomously and return the structured data.
16.9 Key Takeaways
- MCP turns scraping tools into AI-callable operations: no copy-pasting HTML, no manual config writing
- Each tool has a single, well-defined job: structure analysis, selector testing, config generation, execution
- Tools return JSON strings: a universal format any LLM can parse and act on
- The autonomous workflow is 3 tool calls:
fetch_page_structure→generate_scraper_config→run_scrape - The same tool functions work in notebooks: no MCP server process needed for testing