Chapter 16: The Scraper as an MCP Server

The Model Context Protocol (MCP) is an open standard that lets AI assistants call external tools as first-class operations. Instead of copy-pasting URLs and HTML into a chat window, an MCP-enabled agent can call fetch_page_structure(), test_selector(), and run_scrape() directly, the same way it calls a calculator or a file system tool.

This notebook: 1. Explains the 6 tools the scraper MCP server exposes 2. Tests each tool against the demo sites 3. Shows the full autonomous workflow: URL → config → scrape 4. Shows how to connect the server to Claude Code

import asyncio
import json
import os
import sys
sys.path.insert(0, "../agents")

from dotenv import load_dotenv
load_dotenv("../.env")

# -- Backend selection --------------------------------------------------------
# Set to "ollama" to use your local/remote Ollama instance instead of OpenRouter
# Can also set LLM_BACKEND=ollama in .env
os.environ.setdefault("LLM_BACKEND", "openrouter")   # change to "ollama" to switch

from llm import get_backend, get_model

# Import the tool functions directly
# (same functions the MCP server exposes - no server process needed for testing)
from mcp_server import (
    fetch_page_structure,
    generate_scraper_config,
    test_selector,
    run_scrape,
    load_config,
    compare_rendering,
)

SSR_PRODUCTS = "http://localhost:8001/products"
CSR_PRODUCTS = "http://localhost:8002/products"
SSR_JOBS     = "http://localhost:8003/jobs"

print(f"Backend : {get_backend()}")
print(f"Model   : {get_model()}")
print("Tools imported")

16.1 Tool Inventory

Tool	What it does
`fetch_page_structure`	Fetches a URL, returns compact HTML tag tree + CSR detection
`test_selector`	Tests a CSS selector against a live page, returns count + previews
`compare_rendering`	Fetches SSR and CSR versions side-by-side to show the difference
`load_config`	Loads a pre-built demo config by name
`run_scrape`	Executes a full scrape with a config JSON, returns records
`generate_scraper_config`	LLM-generates a config from a URL (uses OpenRouter/Ollama)

All tools return JSON strings: a consistent format that any LLM can parse and reason about.

16.2 fetch_page_structure

The first thing an AI agent does when given an unknown URL: understand its structure. This tool returns a condensed tag tree: enough context for the LLM to infer selectors without overflowing its context window with 300KB of raw HTML.

result = await fetch_page_structure(SSR_PRODUCTS)
r = json.loads(result)

print(f"Status:       {r['status_code']}")
print(f"Is CSR:       {r['is_csr']}")
print(f"Visible text: {r['visible_text_length']} chars")
if r['csr_note']:
    print(f"Note:         {r['csr_note']}")
print()
print("Structure (first 500 chars):")
print(r['structure'][:500])

[04/11/26 13:58:20] INFO     HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK"     _client.py:1740

Status:       200
Is CSR:       False
Visible text: 2222 chars

Structure (first 500 chars):
<body> 'ShopSphereAll ProductsSSRProducts(50 items)CategoriesAll Pro'
  <header class="site-header"> 'ShopSphereAll ProductsSSR'
    <div class="container header-inner"> 'ShopSphereAll ProductsSSR'
      <a class="logo" href="/"> 'ShopSphere'
      <nav class="site-nav"> 'All Products'
        <a href="/"> 'All Products'
      <div class="header-badge"> 'SSR'
  <main class="container"> 'Products(50 items)CategoriesAll ProductsAccessoriesAudioBusi'
    <div class="page-header"> 'Products(50 items

# CSR site - agent sees an empty shell and knows to use Playwright
result = await fetch_page_structure(CSR_PRODUCTS)
r = json.loads(result)

print(f"Status:       {r['status_code']}")
print(f"Is CSR:       {r['is_csr']}")
print(f"Visible text: {r['visible_text_length']} chars")
print(f"Note:         {r['csr_note']}")

[04/11/26 13:58:31] INFO     HTTP Request: GET http://localhost:8002/products "HTTP/1.1 200 OK"     _client.py:1740

Status:       200
Is CSR:       True
Visible text: 159 chars
Note:         Page appears client-side rendered - use render_mode=playwright

16.3 test_selector

Before committing a selector to a config, verify it matches what you expect. An agent uses this to validate selectors it infers from the page structure.

result = await test_selector(SSR_PRODUCTS, "article.product-card")
r = json.loads(result)

print(f"Selector: {r['selector']}")
print(f"Matches:  {r['count']}")
print(f"Note:     {r['note']}")
print()
print("Previews:")
for preview in r['previews']:
    print(f"  text: {preview['text'][:60]!r}")

[04/11/26 13:58:40] INFO     HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK"     _client.py:1740

Selector: article.product-card
Matches:  10
Note:     10 elements matched

Previews:
  text: 'LaptopsMacBook Pro 14"$1999.004.8/5(2341 reviews)In StockThe'
  text: 'LaptopsDell XPS 15$1499.004.6/5(1876 reviews)In StockPremium'
  text: 'LaptopsLenovo ThinkPad X1 Carbon Gen 11$1299.004.7/5(1543 re'
  text: 'LaptopsASUS ZenBook Pro Duo 15$1799.004.4/5(876 reviews)Out '
  text: 'SmartphonesiPhone 15 Pro$999.004.9/5(5432 reviews)In StockTi'

# Wrong selector - agent gets immediate feedback to try something else
result = await test_selector(SSR_PRODUCTS, "div.product")
r = json.loads(result)
print(f"Selector '{r['selector']}': {r['note']}")

[04/11/26 13:58:46] INFO     HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK"     _client.py:1740

Selector 'div.product': No matches found

16.4 compare_rendering

Shows an AI agent (or a human) exactly why CSR sites need Playwright: the same URL returns 10 product cards on SSR and 0 on CSR with a static fetch.

result = await compare_rendering(SSR_PRODUCTS, CSR_PRODUCTS)
r = json.loads(result)

print("SSR:")
print(f"  Visible text:  {r['ssr']['visible_text_chars']} chars")
print(f"  Cards found:   {r['ssr']['cards_found']}")
print(f"  Scrapeable:    {r['ssr']['scrapeable_without_js']}")

print()
print("CSR:")
print(f"  Visible text:  {r['csr']['visible_text_chars']} chars")
print(f"  Cards found:   {r['csr']['cards_found']}")
print(f"  Scrapeable:    {r['csr']['scrapeable_without_js']}")
print(f"  Note:          {r['csr']['note']}")

print()
print("Verdict:", r['verdict'])

[04/11/26 13:58:51] INFO     HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK"     _client.py:1740

                    INFO     HTTP Request: GET http://localhost:8002/products "HTTP/1.1 200 OK"     _client.py:1740

SSR:
  Visible text:  2222 chars
  Cards found:   10
  Scrapeable:    True

CSR:
  Visible text:  159 chars
  Cards found:   0
  Scrapeable:    False
  Note:          CSR requires Playwright. Static fetch returns only the empty HTML shell.

Verdict: SSR: direct scraping works. CSR: JavaScript must execute first - use render_mode=playwright in config.

16.5 load_config + run_scrape

For known demo sites, pre-built configs are available. An agent can load one and run a scrape without generating a new config from scratch.

# Load a pre-built config
result = load_config("shopsphere-ssr")
config = json.loads(result)

print(f"Name:        {config['name']}")
print(f"Render mode: {config['render_mode']}")
print(f"Fields:      {list(config['fields'].keys())}")

# List all available configs
result = load_config("nonexistent")
r = json.loads(result)
print(f"Available:   {r['available']}")

Name:        ShopSphere SSR
Render mode: static
Fields:      ['title', 'price', 'rating', 'review_count', 'category', 'in_stock', 'description', 'product_id', 'tags', 'image', 'gallery_images']
Available:   ['shopsphere-ssr', 'jobhive-csr', 'shopsphere-csr', 'jobhive-ssr']

# Run a scrape with the loaded config
result = await run_scrape(json.dumps(config), max_items=3)
r = json.loads(result)

print(f"Records scraped: {r['count']}")
print(f"Render mode:     {r['render_mode']}")
print()
for rec in r['records']:
    print(f"  {rec['title']}  - ${rec['price']} - {rec['category']} - rating: {rec['rating']}")

[04/11/26 13:59:24] INFO     HTTP Request: GET http://localhost:8001/products?page=1 "HTTP/1.1 200  _client.py:1740
                             OK"

                    INFO     HTTP Request: GET http://localhost:8001/products/macbook-pro-14        _client.py:1740
                             "HTTP/1.1 200 OK"

                    INFO     HTTP Request: GET http://localhost:8001/products/dell-xps-15 "HTTP/1.1 _client.py:1740
                             200 OK"

                    INFO     HTTP Request: GET                                                      _client.py:1740
                             http://localhost:8001/products/lenovo-thinkpad-x1-carbon "HTTP/1.1 200                
                             OK"

Records scraped: 3
Render mode:     static

  MacBook Pro 14"  - $1999.00 - Laptops - rating: 4.8
  Dell XPS 15  - $1499.00 - Laptops - rating: 4.6
  Lenovo ThinkPad X1 Carbon Gen 11  - $1299.00 - Laptops - rating: 4.7

16.6 generate_scraper_config

Given only a URL, the LLM fetches the page structure and generates a complete config. This is the tool an agent uses when encountering a new, unknown site.

# Generate a config for JobHive SSR from scratch
result = await generate_scraper_config(SSR_JOBS)
config = json.loads(result)

print(f"render_mode: {config.get('render_mode')}")
print(f"fields: {list(config.get('fields', {}).keys())}")
print()
print("link_selector:", config.get('listing', {}).get('link_selector'))
print("url_template: ", config.get('sources', [{}])[0].get('url_template'))

Fetching listing page http://localhost:8003/jobs

[04/11/26 13:59:31] INFO     HTTP Request: GET http://localhost:8003/jobs "HTTP/1.1 200 OK"         _client.py:1025

Fetching sample detail page http://localhost:8003/jobs/senior-backend-engineer-stripe

                    INFO     HTTP Request: GET                                                      _client.py:1025
                             http://localhost:8003/jobs/senior-backend-engineer-stripe "HTTP/1.1                   
                             200 OK"

Detail page summary (45 lines)

Listing page summary (47 lines)

Asking LLM to generate config...

[04/11/26 13:59:33] INFO     HTTP Request: POST https://openrouter.ai/api/v1/chat/completions       _client.py:1025
                             "HTTP/1.1 200 OK"

render_mode: static
fields: ['title', 'company', 'location', 'salary', 'job_type', 'tags', 'department', 'experience', 'job_id']

link_selector: article.job-card a.job-link
url_template:  http://localhost:8003/jobs?page={n}

# Run a scrape with the generated config
result = await run_scrape(json.dumps(config), max_items=3)
r = json.loads(result)

print(f"Records scraped: {r['count']}")
print()
for rec in r['records']:
    title   = rec.get('title', 'N/A')
    company = rec.get('company', 'N/A')
    salary  = rec.get('salary', 'N/A')
    print(f"  {title} @ {company} - {salary}")

[04/11/26 13:59:56] INFO     HTTP Request: GET http://localhost:8003/jobs?page=1 "HTTP/1.1 200 OK"  _client.py:1740

                    INFO     HTTP Request: GET                                                      _client.py:1740
                             http://localhost:8003/jobs/senior-backend-engineer-stripe "HTTP/1.1                   
                             200 OK"

                    INFO     HTTP Request: GET                                                      _client.py:1740
                             http://localhost:8003/jobs/staff-software-engineer-airbnb "HTTP/1.1                   
                             200 OK"

                    INFO     HTTP Request: GET                                                      _client.py:1740
                             http://localhost:8003/jobs/senior-frontend-engineer-figma "HTTP/1.1                   
                             200 OK"

Records scraped: 3

  Senior Backend Engineer @ Stripe - $180,000 - $240,000
  Staff Software Engineer @ Airbnb - $200,000 - $280,000
  Senior Frontend Engineer @ Figma - $160,000 - $220,000

16.7 The Full Autonomous Workflow

Putting it together: an AI agent receives a URL and returns structured data without any human intervention.

async def autonomous_scrape(url: str, max_items: int = 5) -> list:
    """
    Full MCP tool chain:
      1. fetch_page_structure  → understand the page
      2. generate_scraper_config → LLM writes the config
      3. run_scrape            → execute and return records
    """
    print(f"Step 1: analysing {url}")
    structure = json.loads(await fetch_page_structure(url))
    print(f"  is_csr={structure['is_csr']}, text_len={structure['visible_text_length']}")

    print("Step 2: generating config via LLM...")
    config = json.loads(await generate_scraper_config(url))
    print(f"  render_mode={config.get('render_mode')}, fields={list(config.get('fields',{}).keys())}")

    print(f"Step 3: scraping {max_items} items...")
    output = json.loads(await run_scrape(json.dumps(config), max_items=max_items))
    print(f"  scraped {output['count']} records")

    return output['records']


records = await autonomous_scrape(SSR_PRODUCTS, max_items=3)
print()
print("Results:")
for rec in records:
    print(f"  {rec.get('title')} - {rec.get('price')} - {rec.get('category')}")

Step 1: analysing http://localhost:8001/products

[04/11/26 14:01:02] INFO     HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK"     _client.py:1740

  is_csr=False, text_len=2222
Step 2: generating config via LLM...

Fetching listing page http://localhost:8001/products

                    INFO     HTTP Request: GET http://localhost:8001/products "HTTP/1.1 200 OK"     _client.py:1025

Fetching sample detail page http://localhost:8001/products/macbook-pro-14

                    INFO     HTTP Request: GET http://localhost:8001/products/macbook-pro-14        _client.py:1025
                             "HTTP/1.1 200 OK"

Detail page summary (45 lines)

Listing page summary (41 lines)

Asking LLM to generate config...

[04/11/26 14:01:03] INFO     HTTP Request: POST https://openrouter.ai/api/v1/chat/completions       _client.py:1025
                             "HTTP/1.1 200 OK"

[04/11/26 14:01:06] INFO     HTTP Request: GET http://localhost:8001/products?page=1 "HTTP/1.1 200  _client.py:1740
                             OK"

                    INFO     HTTP Request: GET http://localhost:8001/products/macbook-pro-14        _client.py:1740
                             "HTTP/1.1 200 OK"

                    INFO     HTTP Request: GET http://localhost:8001/products/dell-xps-15 "HTTP/1.1 _client.py:1740
                             200 OK"

                    INFO     HTTP Request: GET                                                      _client.py:1740
                             http://localhost:8001/products/lenovo-thinkpad-x1-carbon "HTTP/1.1 200                
                             OK"

  render_mode=static, fields=['title', 'price', 'rating', 'category', 'in_stock', 'image', 'description', 'tags', 'specs']
Step 3: scraping 3 items...

  scraped 3 records

Results:
  MacBook Pro 14" - $1999.00 - Laptops
  Dell XPS 15 - $1499.00 - Laptops
  Lenovo ThinkPad X1 Carbon Gen 11 - $1299.00 - Laptops

16.8 Connecting to Claude Code

To use these tools from Claude Code, register the MCP server in your Claude Code settings. Claude Code will start the server automatically and make the tools available in every conversation.

Add to ~/.claude/claude_desktop_config.json (or via claude mcp add):

{
  "mcpServers": {
    "scraper": {
      "command": "/path/to/scrapping/.venv/bin/python",
      "args": ["/path/to/scrapping/agents/mcp_server.py"],
      "env": {
        "LLM_BACKEND": "openrouter",
        "OPENROUTER_API_KEY": "sk-or-v1-..."
      }
    }
  }
}

Or add via CLI:

claude mcp add scraper \
  /path/to/scrapping/.venv/bin/python \
  /path/to/scrapping/agents/mcp_server.py

Once connected, you can ask Claude:

“Scrape the first 10 products from http://localhost:8001/products and give me a table of names, prices, and ratings.”

Claude will call fetch_page_structure → generate_scraper_config → run_scrape autonomously and return the structured data.

16.9 Key Takeaways

MCP turns scraping tools into AI-callable operations: no copy-pasting HTML, no manual config writing
Each tool has a single, well-defined job: structure analysis, selector testing, config generation, execution
Tools return JSON strings: a universal format any LLM can parse and act on
The autonomous workflow is 3 tool calls: fetch_page_structure → generate_scraper_config → run_scrape
The same tool functions work in notebooks: no MCP server process needed for testing