Chapter 11: Storage Patterns for Scraped Data

Scraped data has a shape that challenges conventional storage choices. Product records have variable schemas: a laptop has CPU and RAM specs; a headphone has driver size and impedance. Job listings vary across companies: some have salary ranges, some have only “competitive”. Review counts change daily. Prices change hourly.

The right storage approach depends on what you do with the data after scraping. This chapter covers the main patterns: SQLite for exploration, PostgreSQL for production relational data, and the dual-database pattern (relational + document store) for structured-plus-variable data.

The Scraped Record Shape

A scraped record has two parts:

Core fields: Title, URL, price, rating - consistent across all records of the same type. These map naturally to relational columns.

Variable fields: Specs, features, tags, custom attributes - vary per record. A MacBook Pro has "Processor", "RAM", "Storage". A desk lamp has "Wattage", "Bulb Type", "Cord Length".

Relational databases handle core fields well. Variable fields require either a JSON column, a key-value table, or a document store.

SQLite for Development and Small Datasets

SQLite requires no infrastructure and works perfectly for development, testing, and datasets under a few million records.

import sqlite3
import json
from datetime import datetime

def create_schema(db_path: str):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            slug        TEXT UNIQUE NOT NULL,
            title       TEXT,
            price       REAL,
            rating      REAL,
            category    TEXT,
            in_stock    INTEGER,
            tags        TEXT,  -- JSON array
            specs       TEXT,  -- JSON object
            url         TEXT,
            scraped_at  TEXT,
            raw         TEXT   -- Full JSON record
        )
    """)
    conn.commit()
    conn.close()

def upsert_product(db_path: str, product: dict):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        INSERT INTO products (slug, title, price, rating, category, in_stock,
                              tags, specs, url, scraped_at, raw)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(slug) DO UPDATE SET
            title = excluded.title,
            price = excluded.price,
            rating = excluded.rating,
            category = excluded.category,
            in_stock = excluded.in_stock,
            tags = excluded.tags,
            specs = excluded.specs,
            scraped_at = excluded.scraped_at,
            raw = excluded.raw
    """, (
        product.get("slug"),
        product.get("title"),
        product.get("price"),
        product.get("rating"),
        product.get("category"),
        product.get("in_stock"),
        json.dumps(product.get("tags", [])),
        json.dumps(product.get("specs", {})),
        product.get("url"),
        datetime.utcnow().isoformat(),
        json.dumps(product),
    ))
    conn.commit()
    conn.close()

The ON CONFLICT(slug) DO UPDATE clause is the upsert - if a product with that slug already exists, update it rather than failing. This makes scrape runs idempotent.

Store the full raw JSON record alongside the normalized columns. The raw record is your audit trail: if the extraction logic changes and you need to re-derive a field, the raw data is already stored.

PostgreSQL for Production

PostgreSQL handles the same patterns at production scale, with better concurrency, real foreign keys, and the JSONB column type for efficient querying of JSON data.

import asyncpg
import json
from datetime import datetime

async def upsert_product(pool: asyncpg.Pool, product: dict):
    await pool.execute("""
        INSERT INTO products (slug, title, price, rating, category, in_stock,
                              tags, specs, url, scraped_at, raw)
        VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
        ON CONFLICT (slug) DO UPDATE SET
            title = EXCLUDED.title,
            price = EXCLUDED.price,
            rating = EXCLUDED.rating,
            category = EXCLUDED.category,
            in_stock = EXCLUDED.in_stock,
            tags = EXCLUDED.tags,
            specs = EXCLUDED.specs,
            scraped_at = EXCLUDED.scraped_at,
            raw = EXCLUDED.raw
    """,
        product.get("slug"),
        product.get("title"),
        product.get("price"),
        product.get("rating"),
        product.get("category"),
        product.get("in_stock"),
        json.dumps(product.get("tags", [])),  # JSONB
        json.dumps(product.get("specs", {})), # JSONB
        product.get("url"),
        datetime.utcnow(),
        json.dumps(product),
    )

PostgreSQL’s JSONB type allows querying inside JSON fields:

-- Find all products with RAM >= 16GB in their specs
SELECT title, price FROM products
WHERE specs->>'RAM' ILIKE '%16%'
   OR specs->>'RAM' ILIKE '%32%';

-- Find products with a specific tag
SELECT title FROM products
WHERE tags @> '["Laptops"]'::jsonb;

The Dual-Database Pattern

For high-volume scraping where the schema varies significantly across record types, a dual-database approach provides the best of both worlds:

Relational database (PostgreSQL or SQLite): Core structured fields - slug, title, price, rating, URL, scraped_at. These are consistent and queryable with SQL.

Document store (MongoDB or a JSON file store): The complete raw record with all variable fields. Query by slug to retrieve the full document.

# Write core fields to PostgreSQL
await pg_pool.execute(
    "INSERT INTO products (slug, title, price, rating, scraped_at) VALUES ($1,$2,$3,$4,$5)"
    " ON CONFLICT (slug) DO UPDATE SET title=$2, price=$3, rating=$4, scraped_at=$5",
    product["slug"], product["title"], product.get("price"),
    product.get("rating"), datetime.utcnow()
)

# Write full document to MongoDB
await mongo_collection.replace_one(
    {"slug": product["slug"]},
    {**product, "scraped_at": datetime.utcnow()},
    upsert=True
)

Use PostgreSQL when you need: aggregate queries (average price by category), filtering on multiple fields, joins across tables.

Use MongoDB when you need: the full variable-schema document, flexible querying of nested fields, rapid iteration when the schema is still evolving.

Change Tracking

Scraped data changes. Prices go up and down. Items go in and out of stock. Job listings expire. Tracking these changes over time requires more than upsert - it requires versioned records.

The simplest approach: a price_history table alongside the current products table.

CREATE TABLE products (
    slug        TEXT PRIMARY KEY,
    title       TEXT,
    price       REAL,
    in_stock    BOOLEAN,
    scraped_at  TIMESTAMPTZ
);

CREATE TABLE price_history (
    id         SERIAL PRIMARY KEY,
    slug       TEXT REFERENCES products(slug),
    price      REAL,
    recorded_at TIMESTAMPTZ DEFAULT NOW()
);

When upserting a product, check if the price changed before inserting a history record:

async def upsert_with_history(pool, product):
    async with pool.acquire() as conn:
        # Get current price
        row = await conn.fetchrow(
            "SELECT price FROM products WHERE slug = $1",
            product["slug"]
        )

        current_price = row["price"] if row else None
        new_price = product.get("price")

        # Upsert the main record
        await conn.execute("""
            INSERT INTO products (slug, title, price, in_stock, scraped_at)
            VALUES ($1, $2, $3, $4, NOW())
            ON CONFLICT (slug) DO UPDATE SET
                title = EXCLUDED.title,
                price = EXCLUDED.price,
                in_stock = EXCLUDED.in_stock,
                scraped_at = NOW()
        """, product["slug"], product["title"], new_price, product.get("in_stock"))

        # Record price change
        if current_price != new_price and new_price is not None:
            await conn.execute(
                "INSERT INTO price_history (slug, price) VALUES ($1, $2)",
                product["slug"], new_price
            )

File-Based Storage for Simple Cases

For development and small-scale extraction, JSON files are sometimes the right choice:

import json
from pathlib import Path
from datetime import datetime

def save_results(results: list[dict], output_dir: str, run_id: str):
    output_path = Path(output_dir) / f"{run_id}.json"
    output_path.write_text(json.dumps({
        "run_id": run_id,
        "scraped_at": datetime.utcnow().isoformat(),
        "count": len(results),
        "records": results,
    }, indent=2))
    return output_path

File-based storage is simple to inspect, easy to move between systems, and does not require database infrastructure. The downside: no indexing, no query capability, no deduplication. Use it for one-off extractions and prototyping.

Schema Evolution

Scraped schemas change. The target site adds a field you want. It removes a field you were extracting. The config changes to match.

For relational storage, schema evolution requires migrations: ALTER TABLE products ADD COLUMN discount_pct REAL. For MongoDB or JSON storage, schema evolution is free - new fields simply appear in new documents.

The pragmatic approach for long-lived scraping projects: use a relational table for the queryable core fields and evolve them with explicit migrations. Store the full raw document in a JSON column or document store so that new fields are always available for retrospective analysis, even before you added them to the schema.

Apply This

1. Always upsert, never insert. Use ON CONFLICT ... DO UPDATE in SQL or replace_one(upsert=True) in MongoDB. Idempotent writes mean you can re-run any scrape without worrying about duplicates.

2. Store the raw record alongside normalized fields. A raw JSONB or raw TEXT column containing the full extracted JSON is free insurance. When the extraction logic changes, you can re-derive normalized fields from the raw record without re-scraping.

3. Track changes that matter. Price history and stock status history are the most common valuable time series in product scraping. Implement change tracking from the start; retrofitting it to an existing dataset is painful.

4. Use SQLite for development. No infrastructure, no config, just a file. Swap in PostgreSQL for production by changing the connection string. The SQL dialect differences are minimal.

5. Index your primary lookup fields. slug (or whatever unique identifier you use for deduplication) must be indexed. Query fields like category and in_stock benefit from indexes once the dataset grows beyond a few hundred thousand records.