Chapter 13: Config as Language for AI Agents

The preceding twelve chapters built a complete scraping system: config-driven extraction, static and headless fetching, pagination, scheduling, dual-database storage. Every component works. Every component was built by humans who inspected sites, wrote selectors, and authored JSON configs.

This chapter asks a different question: what if the human is optional?

The Productivity Ceiling

Traditional scraping infrastructure has a productivity ceiling defined by human time. Someone must inspect each target site, identify selectors, write the config, test it, deploy it, and maintain it when the site changes. For a team running 50 scrapers, that is a continuous human workload.

The ceiling is most visible in long-tail scraping: hundreds of small data sources that each individually justify a scraper but collectively exceed what any team can maintain by hand. Market intelligence platforms, alternative data providers, and research data services all run into this ceiling. They solve it by hiring more people, which scales poorly.

The JSON config approach was designed with this ceiling in mind. The reason the config exists as data rather than code is precisely so that machines can generate and modify it. The config is the interface between human intent and machine execution, but the human does not have to be the one writing it.

What LLMs Can Do with Configs

Large language models have three capabilities relevant to scraper configs:

Generation. Given an HTML page structure and a list of fields to extract, an LLM can produce a valid JSON config. The model has seen thousands of examples of CSS selectors, HTML structures, and data extraction patterns. It can look at an unfamiliar page and infer which elements contain which data.

Repair. Given a config and error feedback (“selector h1.product-title matched nothing on this page”), an LLM can examine the HTML, identify what changed, and propose a corrected selector. This is the same reasoning a human engineer would apply to a broken scraper, but faster and available at 3am.

Validation. Given a config and scraped data, an LLM can evaluate whether the output looks correct. “The salary_min field contains $180,000 and salary_max contains $240,000 - these look like correct salary values for a senior engineering role.” This is data quality validation using the model’s knowledge of what the data should look like.

The Structure Advantage

JSON with a defined schema is a better AI output format than Python code for several reasons.

Sandboxability. A JSON config cannot execute arbitrary system calls. It describes data extraction in a constrained vocabulary: selectors, retrieve methods, pagination parameters. The engine executes the config in a controlled way. Python code generated by an LLM might import subprocess and run shell commands.

Validatability. Before the engine runs a config, it can validate the schema: are all required fields present? Are the retrieve values from the allowed set? Are there any obviously wrong selectors (the CSS is syntactically invalid)? Schema validation catches a large class of errors that would otherwise produce silent failures.

Inspectability. A non-engineer can read a JSON config and understand what it does. They can compare two versions and see what changed. They can approve changes in a code review without understanding Python. This opens the config authoring workflow to a broader audience.

Diffability. Configs change when sites change. A diff between two config versions is human-readable: “the price selector changed from .product-price to .price-amount.” Diffs in Python code are noisier.

Building the Config Generation Pipeline

The autonomous config generation pipeline has three stages:

Stage 1: Page structure extraction. Fetch the target URL and produce a compact structural summary: the significant HTML elements with their class names, data attributes, and text previews. The full HTML is too large for an LLM context window; a structural summary is sufficient for selector inference.

def summarise_html(html: str, max_elements: int = 60) -> str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "meta", "link"]):
        tag.decompose()

    lines = []
    seen = set()

    def walk(el, depth=0):
        if len(lines) >= max_elements:
            return
        classes = " ".join(el.get("class", []))
        tag_id = f"{el.name}.{classes}"
        if tag_id not in seen:
            seen.add(tag_id)
            text = el.get_text(strip=True)[:60]
            lines.append(f"{'  ' * depth}<{el.name} class='{classes}'> {text!r}")
        for child in el.children:
            if hasattr(child, "name") and child.name:
                walk(child, depth + 1)

    walk(soup.body or soup)
    return "\n".join(lines)

Stage 2: LLM config generation. Pass the structural summary to the LLM with a system prompt that specifies the config schema and generation rules:

SYSTEM_PROMPT = """You are an expert web scraping engineer.
Given an HTML page structure, produce a valid scraper config JSON.

Schema:
{ "render_mode": "static"|"playwright", "sources": [...], "listing": {...}, "fields": {...} }

Rules:
- Output ONLY valid JSON, no explanation.
- Use the most specific CSS selectors visible in the HTML.
- If page appears empty (CSR), set render_mode to playwright.
- Include fields: all visible data fields relevant to the site type.
"""

response = llm.chat([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"URL: {url}\n\nHTML structure:\n{summary}"},
])
config = json.loads(response)

Stage 3: Validation and execution. Validate the generated config structurally (schema check) and functionally (run against a single page, verify fields are populated). If validation fails, pass the error back to the LLM for repair.

The Repair Loop

Config generation rarely produces a perfect result on the first attempt. Sites have idiosyncratic structures; class names are not always descriptive; the structural summary may omit a critical element. The repair loop handles these failures:

async def generate_with_repair(url: str, max_retries: int = 3) -> dict:
    config = await generate_config(url)

    for attempt in range(max_retries):
        results = await test_config(config, url, max_items=3)

        if not results:
            error = "No items scraped. The link_selector may be wrong."
        else:
            empty_fields = [k for k, v in results[0].items() if v is None]
            if not empty_fields:
                return config  # Success
            error = f"These fields are empty: {empty_fields}"

        # Ask LLM to fix
        config = await repair_config(config, url, error)

    return config  # Return best effort after max_retries

The repair loop is the difference between a demo and a production system. A demo generates one config and shows it. A production system iterates until the output meets a quality threshold.

AI as Config Author vs. AI as Config Executor

It is important to distinguish two roles AI can play in a scraping system:

Config author. The AI generates and maintains configs. Humans review and approve them. The execution engine is unchanged: it runs whatever valid config it is given. This is the lower-risk integration: the AI’s output is data (JSON), not code; it is human-reviewable; it runs through the existing execution infrastructure.

Autonomous executor. The AI decides what to scrape, generates the config, runs the scrape, evaluates the results, and stores them without human involvement in the loop. This is higher-risk: a bug in the generation pipeline could produce garbage configs that waste compute, or worse, generate incorrect data that flows downstream unchecked.

Both roles are valid. The right choice depends on whether the downstream consumer of the data needs human-verified quality guarantees.

The Memory Problem

An autonomous scraping agent needs to remember what it has already figured out. If it regenerates a config from scratch every time it visits a site, it wastes LLM calls and produces inconsistent configs.

The solution is a config registry: a database mapping site URLs to their last successful configs. When the agent is asked to scrape a known URL, it loads the existing config rather than generating a new one. When the config fails (because the site changed), it triggers the repair loop.

The config registry is exactly what the bylgja configs table implements: configs are versioned, associated with projects, and flagged as active or inactive. The scraper engine loads the active config; when a config is updated, the new version becomes active.

Indexing into Autonomous Systems

The JSON config is not just a local tool. It is a protocol. Any system that understands the bylgja config schema can:

  • Submit scraping tasks: “scrape https://example.com/jobs and extract title, company, salary, location”
  • Receive results: structured JSON records in a defined format
  • Monitor quality: compare results to expected schemas, alert on empty fields or unexpected values
  • Trigger repairs: detect when a config stops producing results and initiate the repair loop

This is the foundation of autonomous data infrastructure: instead of engineers maintaining scraper code, AI agents maintain scraper configs against a shared execution engine. Engineers define what data they need; agents figure out how to get it.

The MCP integration in Chapter 15 exposes this infrastructure as tools that any AI agent (including Claude Code, Claude Desktop, or any MCP-compatible system) can call. The tools: generate_scraper_config, run_scrape, test_selector, compare_rendering. The agent calls these tools to answer data questions autonomously.

Choosing an LLM Backend

The agent code in this book abstracts the LLM call behind a single chat() function in agents/llm.py. Two backends are supported and switchable with one environment variable:

# Use OpenRouter (cloud, any model via a single API key)
LLM_BACKEND=openrouter

# Use Ollama (local or self-hosted, fully private)
LLM_BACKEND=ollama

OpenRouter routes requests to hosted models (Mistral, GPT-4o, Claude, Llama, and hundreds of others) through one unified API. It is the practical default: no GPU required, no model management, and you can swap models by changing one env var.

Ollama runs open-weight models locally or on a machine you control. The Ollama server exposes an OpenAI-compatible endpoint, so the same SDK code works unchanged; only base_url and api_key differ:

# OpenRouter
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

# Ollama: api_key is required by the SDK but not validated by Ollama
client = OpenAI(
    base_url="http://192.168.0.177:11434/v1",
    api_key="ollama",
)

The full backend selection logic in agents/llm.py:

def get_client() -> OpenAI:
    backend = os.environ.get("LLM_BACKEND", "openrouter")
    if backend == "ollama":
        return OpenAI(
            base_url=os.environ.get("OLLAMA_BASE_URL", "http://localhost:11434/v1"),
            api_key="ollama",
        )
    return OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key=os.environ["OPENROUTER_API_KEY"],
    )

def get_model() -> str:
    if os.environ.get("LLM_BACKEND") == "ollama":
        return os.environ.get("OLLAMA_MODEL", "ministral-3:14b")
    return os.environ.get("OPENROUTER_LLM_MODEL", "mistralai/ministral-14b-2512")

Configure both in .env:

LLM_BACKEND=openrouter          # or: ollama

OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_LLM_MODEL=mistralai/ministral-14b-2512

OLLAMA_BASE_URL=http://192.168.0.177:11434/v1
OLLAMA_MODEL=ministral-3:14b

Backend comparison

Both backends produced identical scraper configs in testing against the demo sites:

Backend Model Config quality Latency Cost
OpenRouter mistralai/ministral-14b-2512 render_mode: static yes, correct detail-page selectors ~2s per token
Ollama ministral-3:14b render_mode: static yes, correct detail-page selectors ~3s free (self-hosted)

Both returned valid, schema-clean JSON with no rogue keys, correct link_selector, and field selectors targeting the detail page rather than the listing cards. The autonomous scraper (generate config, scrape, LLM validate) completed successfully with both backends, producing 3 records with all critical fields populated.

The choice between backends is operational, not functional: use OpenRouter for zero-maintenance access to the latest models; use Ollama when data privacy, offline capability, or cost at scale are priorities.

Apply This

1. Generate configs, not code. If you are building an AI-assisted scraping pipeline, output JSON configs rather than Python. The security, reviewability, and auditability advantages are significant.

2. Include the repair loop in any production AI pipeline. First-attempt config quality is inconsistent. A repair loop with 2-3 iterations dramatically improves success rates without significant cost.

3. Rate-limit your LLM calls, not just your scraper. Config generation is expensive. Cache generated configs aggressively. Regenerate only when a config starts failing, not on every scrape run.

4. Use structured output APIs when available. Modern LLM APIs support JSON Schema-constrained output, where the model is guaranteed to produce valid JSON matching your schema. This eliminates JSON parsing errors in generated configs.

5. Keep humans in the loop for high-stakes data. For data that feeds business decisions, require human review of AI-generated configs before they go to production. The AI dramatically reduces the work of authoring configs; the human provides the quality gate.