Scaling Headless Chrome
Headless Chrome runs browser automation without a GUI. It handles scraping JavaScript-rendered pages, running automated tests, and capturing screenshots. This guide covers how to scale it for high-throughput workloads using Rust-based tools, containers, and cloud infrastructure.
Libraries and Tools
headless-browser
The headless-browser library manages multiple browser instances across Chrome, Firefox, and others. For scraping, chrome-headless-shell gives the best performance since it skips the full browser UI entirely.
Spider’s CDP Handler (chromey)
Spider’s CDP handler is a Rust-based Chrome DevTools Protocol client, similar to Puppeteer but built for Rust’s concurrency model:
- Concurrent page handling: Process multiple pages simultaneously without degradation
- Built-in ad-blocking: Faster page loads and less bandwidth
- Tracker blocking: Reduces page weight and speeds up rendering
It optimizes CDP communication in Rust and provides the concurrency backbone for Spider’s crawling engine.
Containers
Docker lets you run isolated Chrome instances with controlled resource limits. This is the standard approach for scaling horizontally: spin up more containers as load increases.
Error Handling and Stability
- Automatic restarts: Detect and restart unresponsive browser instances. Chrome processes can hang or leak memory under heavy load
- Session caching: Cache browser sessions and outputs to avoid redundant work on repeated requests
Cloud Scaling with AWS Fargate
Fargate runs containerized Chrome instances without managing servers:
- Elastic scaling: Fargate v2 CPUs with 16GB memory scale up and down based on demand
- Pay-per-use: You only pay for the compute and memory your headless operations consume
- Shared volumes: Mount shared storage across containers to reduce redundant downloads
Proxies and Caching
- Rotating proxies: Prevent IP blocks and reduce captcha triggers by routing requests through different IPs
- Output caching: Cache JSON outputs from Chrome sessions to speed up repeated requests
Load Balancing with ALB
An Application Load Balancer (ALB) with Least Outstanding Requests routing distributes traffic effectively across Chrome instances. This works better than NLB for headless Chrome because request processing times vary widely. Some pages load in 100ms, others take 10+ seconds.
Configuration
- Use ALB over NLB: NLB distributes connections evenly regardless of workload. ALB with “Least Outstanding Requests” sends new connections to the instance with the fewest active requests.
- Least Outstanding Requests: Routes to the target with the lowest in-progress task count. Note: incompatible with “Slow Start Duration”, so instances must handle traffic immediately on launch.
WebSocket Retry Logic
WebSocket connections to Chrome instances can fail during scaling events or container restarts. Use exponential backoff:
// url usually the /json/version path to the instance.
const connectWebSocket = async (url) => {
let attempts = 0;
const maxAttempts = 5;
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
while (attempts < maxAttempts) {
try {
// Fetch the JSON data to get WebSocket debugger URL
const response = await fetch(url);
const data = await response.json();
// Assume the WebSocket URL is within the JSON response
const wsDebuggerUrl = data.webSocketDebuggerUrl;
// replace WebSocket with any puppeteer, playwright, or chromey method.
const ws = new WebSocket(wsDebuggerUrl);
ws.onopen = () => {
console.log('WebSocket connection established');
};
return ws; // exit if successful
} catch (error) {
attempts++;
if (attempts < maxAttempts) {
console.log(`Retry attempt ${attempts} for WebSocket connection`);
await delay(1000 * attempts); // exponential backoff
} else {
console.error('Failed to establish WebSocket connection after several attempts');
throw error;
}
}
}
};
With ALB routing, your application never depends on a specific instance. Any container can handle any request, giving you redundancy and fault tolerance out of the box.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.