Chapter 1: Why Scraping Exists

The web contains more structured data than any database. Product prices, job postings, real estate listings, academic citations, court records, sports statistics: all of it lives in HTML, behind URLs, accessible to any HTTP client. The gap between “it’s on the web” and “I can use it programmatically” is the business case for web scraping.

The Data Gap

Every large company has an internal data infrastructure. APIs, databases, data warehouses, pipelines. This infrastructure is expensive to build and expensive to expose. When a company publishes a website, it is publishing data to humans. Giving that same data to machines is a separate decision, one that many companies never make.

The result is a permanent asymmetry: the human-readable web is rich; the machine-readable web is sparse. The companies that build APIs tend to be platforms that benefit from developer ecosystems. The companies that do not build APIs are often the most interesting data sources: niche marketplaces, local directories, government databases, specialized job boards, procurement systems.

Scraping is how you extract data from the non-API part of the internet.

What Scraping Actually Is

Scraping is structured extraction from unstructured (or semi-structured) sources. “Unstructured” here means HTML intended for human consumption, not machine parsing. The HTML has implicit structure: a product name is always in an h1, the price is always in a .price-amount span. Scraping is the act of making that implicit structure explicit.

The naive mental model of scraping is “download HTML, grep for values.” That model breaks immediately when you encounter:

  • JavaScript-rendered content (the HTML sent by the server is empty)
  • Pagination (the data is spread across hundreds of pages)
  • Dynamic elements (prices that change based on location or login state)
  • Anti-bot measures (rate limiting, CAPTCHAs, fingerprinting)

Serious scraping requires a systematic approach to each of these problems. This book provides that approach.

What Scraping Is Not

Scraping is not crawling. A crawler visits URLs to discover more URLs, building a graph of the web. Scraping visits URLs to extract specific data points from each. These activities often combine: you crawl a marketplace to find product URLs, then scrape each product URL for price and specs. But the concepts are distinct.

Scraping is not parsing. Parsing is a low-level operation: turning a string into a structured object. Scraping orchestrates parsing across many pages, with retry logic, rate limiting, storage, and scheduling.

Scraping is not a data pipeline, though scrapers often feed data pipelines. The scraper collects; the pipeline transforms, stores, and serves.

Use Cases

The range of legitimate scraping use cases is vast:

Price intelligence. E-commerce companies track competitor pricing. A retailer with 50,000 SKUs needs daily competitor prices to stay competitive. No competitor offers an API. Scraping is the only option.

Job market analysis. Economists, researchers, and job seekers track which skills are in demand, which companies are hiring, and how salaries are shifting. This data lives in job boards. Job boards do not offer research APIs.

Lead generation. Sales teams build lists of target companies from directories, review sites, and LinkedIn. Every lead list starts with some form of structured extraction from a web source.

Content aggregation. News aggregators, travel comparison sites, and real estate portals aggregate listings from many sources. They are, at their core, scrapers with a consumer-facing interface on top.

Academic research. Social scientists study online behavior, economists analyze markets, public health researchers track disease mentions. Their data sources are websites.

Monitoring and alerting. Track when a product comes back in stock. Alert when a competitor changes a key page. Detect when a government site publishes a new document. All of these require repeatedly fetching and comparing pages.

Government data liberation. Government websites publish public records that have not been converted to machine-readable formats. FOIA researchers, investigative journalists, and civic technologists scrape to make this data accessible.

The Architecture of This Book

This book is organized around the problems you encounter when building a production scraper:

Part 1 (Foundations) establishes the core concepts: how HTTP works, what rendering means, and why a JSON-config approach to scraping is worth learning.

Part 2 (Extraction) goes deep on selectors and field extraction: how to target the right elements, how to handle edge cases, how to extract structured data from tables and lists.

Part 3 (Rendering) covers static fetching, Playwright for JavaScript-rendered pages, and the auto-fallback pattern that tries static first and escalates to headless only when needed.

Part 4 (Scale) covers scheduling, storage, and monitoring: running scrapers on a schedule, persisting results to databases, and alerting when things go wrong or data changes.

Part 5 (AI Agents) connects the JSON-config approach to AI: how LLMs can generate configs, how autonomous agents can discover and scrape sites without human configuration, and how MCP exposes scraping as a tool for AI systems.

The Central Thesis

The JSON-config approach to scraping makes a bet: that the declarative specification of extraction logic is more valuable than imperative code that implements it. A config says “extract the price from the element with class .price-amount, applying a regexp to strip the dollar sign.” Code says the same thing in twenty lines of Python. The config is shorter, readable, and something an AI can generate.

When you can generate the config from the site structure, you have autonomous scraping. Not scraping that is automated (that is just scheduling), but scraping that requires no human to define the extraction rules. The human defines the target (a URL and a list of fields to extract); the AI writes the config; the engine executes it.

That is the system this book builds.

Apply This

1. Start with a data audit. Before writing a single line of scraping code, document exactly what fields you need from each page, what makes each record unique (the natural key), and how often the data changes. This audit becomes your config specification.

2. Distinguish crawling from scraping in your mental model. Crawling builds the URL list; scraping extracts data from each URL. Design them as separate subsystems. The crawler output is the scraper input.

3. Check the API landscape first. Many data sources have partial API coverage. Scraping what the API does not cover is more sustainable than scraping everything. Pitfall: APIs deprecate; build scraper fallbacks even when an API exists.

4. Respect the source. Use appropriate rate limits, identify your bot in the User-Agent, and schedule scrapes during off-peak hours. Aggressive scrapers get blocked; polite ones run for years.

5. Think in fields, not pages. The output of a scraper is a collection of structured records. Define the record schema first. Every scraping decision follows from “what do we need to populate this field?”