Concurrent Streaming

Process crawl results in real time instead of waiting for the entire job to finish. Streaming is the recommended approach for any crawl over a few dozen pages. It reduces memory usage, avoids HTTP timeouts, and gives you sub-second time-to-first-result.

How Streaming Works

Set the Content-Type header to application/jsonl and enable streaming in your HTTP client. Spider sends each page as a newline-delimited JSON object the moment it finishes processing. The crawler still runs at full concurrency behind the scenes. Pages arrive as fast as they are crawled, not in any fixed order.

When to Use Streaming

Streaming is particularly valuable in three scenarios:

  • Large crawls: Any crawl with limit over 50 pages. Without streaming, the response buffer grows until the entire crawl completes, which can hit HTTP timeouts or memory limits.
  • Real-time pipelines: When you need to process, embed, or store data as it arrives, for example feeding a RAG pipeline or vector database during the crawl.
  • Progress tracking: Each streamed line tells you which page was crawled and its status, so you can show progress to users or log metrics incrementally.

Python: Streaming with the API

Use the requests library with stream=True and iterate over lines as they arrive.

Python: Streaming Responses

import requests, json, os def process_page(page: dict): url = page.get("url", "unknown") status = page.get("status", 0) content = page.get("content", "") print(f"Crawled: {url} ({status}) - {len(content)} chars") headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/jsonl', } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json={ "url": "https://www.example.com", "limit": 100, "depth": 3, "request": "smart", "return_format": "markdown" }, stream=True, timeout=120 ) response.raise_for_status() for line in response.iter_lines(decode_unicode=True): if line: page = json.loads(line) process_page(page)

Python: Streaming with the SDK

The Python SDK has built-in streaming support via the stream and callback parameters.

Python SDK: Streaming

from spider import Spider app = Spider() def handle_page(page: dict) -> None: print(f"Crawled: {page['url']} ({page['status']})") result = app.crawl_url( "https://www.example.com", params={ "limit": 100, "depth": 3, "request": "smart", "return_format": "markdown" }, stream=True, callback=handle_page, )

Node.js: Streaming

Use the Fetch API with a ReadableStream to process each line as it arrives.

Node.js: Streaming Responses

const response = await fetch('https://api.spider.cloud/crawl', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.SPIDER_API_KEY}`, 'Content-Type': 'application/jsonl', }, body: JSON.stringify({ url: 'https://www.example.com', limit: 100, depth: 3, request: 'smart', return_format: 'markdown', }), }); const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n'); buffer = lines.pop(); // Keep incomplete line in buffer for (const line of lines) { if (line.trim()) { const page = JSON.parse(line); console.log(`Crawled: ${page.url} (${page.status})`); } } }