Chapter 8: Playwright for Dynamic Content
When a site loads its content through JavaScript rather than returning it in the initial HTML response, static fetching returns an empty shell. Playwright solves this by operating a real browser: it downloads the page, executes JavaScript, waits for the DOM to stabilize, and returns the fully rendered HTML.
This chapter covers Playwright’s integration with the scraping engine: launching browsers, navigating pages, waiting for content, and extracting data from the rendered DOM.
Why Playwright Over Selenium
Playwright is the modern choice for browser automation. Compared to Selenium:
- Async-native: Playwright’s Python API is designed for asyncio
- Auto-wait: Playwright waits for elements to be visible before interacting, reducing flaky tests
- Network interception: Playwright can intercept and modify network requests
- Multiple browsers: Chromium, Firefox, and WebKit (Safari engine) with a single API
- Faster: Playwright’s protocol is more efficient than the WebDriver protocol used by Selenium
For scraping, the key advantage is async support - async with async_playwright() integrates naturally with the async scraping engine.
Basic Playwright Fetch
The minimal pattern: launch a browser, navigate to a URL, wait for the page to load, and return the HTML.
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
async def fetch_playwright(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
html = await page.content()
await browser.close()
return html
# Use exactly like fetch_static
html = await fetch_playwright("http://localhost:8002/products")
soup = BeautifulSoup(html, "lxml")
cards = soup.select("article.product-card")
print(f"Found {len(cards)} products") # 10 products, unlike with static fetchwait_until="networkidle" tells Playwright to wait until no network requests have been made for 500 milliseconds. This is the standard approach for CSR sites: wait until the JavaScript has finished fetching data and rendering the DOM.
The wait_until Options
Different pages need different wait strategies:
# "load": Wait for the load event (DOMContentLoaded + all resources)
await page.goto(url, wait_until="load")
# "domcontentloaded": Wait for initial HTML parse only (faster, before images)
await page.goto(url, wait_until="domcontentloaded")
# "networkidle": Wait until no network requests for 500ms (most thorough)
await page.goto(url, wait_until="networkidle")
# "commit": Return as soon as navigation started (rarely useful for scraping)
await page.goto(url, wait_until="commit")For CSR sites that fetch data immediately on load, networkidle is the right choice. For sites with aggressive analytics, advertisements, or chat widgets that continuously make network requests, networkidle never triggers - use load and add an explicit wait instead:
await page.goto(url, wait_until="load")
await page.wait_for_selector(".product-card", timeout=10000) # Wait up to 10s for first cardWaiting for Specific Content
Instead of waiting for a time condition, wait for a specific element that indicates the content is ready:
async def fetch_playwright_with_probe(url: str, probe_selector: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="load")
# Wait for the first product card to appear
await page.wait_for_selector(probe_selector, timeout=15000)
html = await page.content()
await browser.close()
return html
html = await fetch_playwright_with_probe(
"http://localhost:8002/products",
probe_selector=".product-card"
)This is more reliable than networkidle because it directly checks for the content you need. If the selector never appears within the timeout, Playwright raises TimeoutError - a clear signal that the fetch failed.
Reusing Browser Instances
Launching a new browser for every URL is expensive. A single Playwright browser can serve multiple pages through a browser context:
async def scrape_csr_site(urls: list[str]) -> list[str]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
)
results = []
for url in urls:
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
html = await page.content()
await page.close()
results.append(html)
await browser.close()
return resultsEach context.new_page() creates a new tab. Pages within the same context share cookies and storage but run independently. Closing each page after use prevents memory from growing unbounded across large scraping runs.
Concurrency with Playwright
Multiple pages can run concurrently in the same browser context:
async def fetch_parallel(urls: list[str], max_concurrent: int = 3) -> list[str]:
semaphore = asyncio.Semaphore(max_concurrent)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
async def fetch_one(url):
async with semaphore:
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle")
html = await page.content()
return html
finally:
await page.close()
results = await asyncio.gather(*[fetch_one(url) for url in urls])
await browser.close()
return resultsMore than 3-5 concurrent Playwright pages can strain memory and CPU. Each browser page is a full rendering process. Test with your target machine’s resource constraints.
User Agent and Browser Fingerprint
Headless browsers have detectable characteristics. Sites with anti-bot measures check:
- The
User-Agentstring (headless Chromium reports itself) - JavaScript properties like
navigator.webdriver(set totruein automated browsers) - Canvas and WebGL fingerprints
- Timing patterns
For most scraping targets, setting a realistic user agent is sufficient:
context = await browser.new_context(
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
viewport={"width": 1920, "height": 1080},
)For sites with more sophisticated detection, Playwright Extra and the stealth plugin can hide automation signals. These are beyond the scope of this book but are well-documented in the Playwright ecosystem.
Extracting Data Inside the Browser
Sometimes it is more efficient to extract data using JavaScript inside the browser rather than exporting the full HTML to Python:
# Execute JavaScript in the browser context and return the result
products = await page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('h2.product-name')?.textContent.trim(),
price: card.querySelector('.price-amount')?.textContent.trim(),
url: card.querySelector('a.product-link')?.href,
}));
}
""")
# products is a Python list of dicts, returned by JSON serialization
print(products[0])
# {'name': 'MacBook Pro 14"', 'price': '$1,999.00', 'url': 'http://localhost:8002/products/macbook-pro-14'}page.evaluate() runs JavaScript synchronously in the page context and returns the result, automatically deserializing JSON-compatible values.
This approach is faster than exporting full HTML when you need only a subset of data, and avoids the HTML parsing step entirely. The downside: the extraction logic is in JavaScript, which is less convenient than Python for complex transforms.
Intercepting API Calls
CSR sites fetch their data from APIs. Instead of scraping the rendered DOM, you can intercept the API calls directly:
async def intercept_api(url: str) -> list[dict]:
api_responses = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# Register a route handler that intercepts API calls
async def handle_route(route):
response = await route.fetch()
body = await response.json()
if "products" in body:
api_responses.append(body)
await route.fulfill(response=response)
await page.route("**/api/products*", handle_route)
await page.goto(url, wait_until="networkidle")
await browser.close()
# All captured API responses
return api_responsesWhen the JavaScript on the page calls /api/products?page=1, the route handler captures the response. The data arrives as clean JSON, bypassing HTML parsing entirely.
This is often the most efficient approach for CSR sites: let the browser load the page (which triggers the API calls), capture the API responses, and skip HTML extraction entirely.
Playwright in the Config Engine
The config’s render_mode field controls which fetch path the engine uses:
{
"render_mode": "static",
...
}{
"render_mode": "playwright",
...
}The engine resolves render_mode to a fetch function:
async def fetch_page(url: str, render_mode: str, probe_selector: str = None) -> str:
if render_mode == "static":
return await fetch_static(url)
elif render_mode == "playwright":
return await fetch_playwright(url, probe_selector)
elif render_mode == "auto":
return await fetch_auto(url, probe_selector) # Chapter 9
else:
raise ValueError(f"Unknown render_mode: {render_mode}")The rest of the engine - pagination, field extraction, storage - is identical regardless of render mode. The only difference is which function fetches the HTML. The same CSS selectors that work on ShopSphere SSR work on ShopSphere CSR after Playwright renders the page.
Apply This
1. Install Playwright’s browsers before running. playwright install chromium downloads the Chromium binary that Playwright uses. Without this step, the browser launch fails.
2. Default to wait_until="networkidle" for CSR sites, but have a fallback. If networkidle never fires (due to background requests), use wait_until="load" combined with wait_for_selector(probe) to wait for specific content.
3. Close pages after use. Accumulating open pages in a long-running browser context leaks memory. Close each page when you are done with it.
4. Consider API interception before DOM scraping. If the site is CSR, find the API calls in the browser’s Network tab. Intercepting the API directly may be simpler and faster than scraping the rendered DOM.
5. Test both SSR and CSR configs with the same selectors. The demo sites are designed to verify this: configs/shopsphere-ssr.json and configs/shopsphere-csr.json use identical CSS selectors. Only render_mode differs. If your selectors work on the SSR site, they should work identically on the CSR site once Playwright has rendered it.