Concurrent Streaming

Process crawl results in real time instead of waiting for the entire job to finish. Streaming is the recommended approach for any crawl over a few dozen pages — it reduces memory usage, avoids HTTP timeouts, and gives you sub-second time-to-first-result.

How Streaming Works

Set the Content-Type header to application/jsonl and enable streaming in your HTTP client. Spider sends each page as a newline-delimited JSON object the moment it finishes processing. The crawler still runs at full concurrency behind the scenes — pages arrive as fast as they are crawled, not in any fixed order.

When to Use Streaming

Streaming is particularly valuable in three scenarios:

  • Large crawls: Any crawl with limit over 50 pages. Without streaming, the response buffer grows until the entire crawl completes, which can hit HTTP timeouts or memory limits.
  • Real-time pipelines: When you need to process, embed, or store data as it arrives — for example, feeding a RAG pipeline or vector database during the crawl.
  • Progress tracking: Each streamed line tells you which page was crawled and its status, so you can show progress to users or log metrics incrementally.

Python — Streaming with the API

Use the requests library with stream=True and iterate over lines as they arrive.

Python — Streaming Responses

import requests, json, os def process_page(page: dict): url = page.get("url", "unknown") status = page.get("status", 0) content = page.get("content", "") print(f"Crawled: {url} ({status}) — {len(content)} chars") headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/jsonl', } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json={ "url": "https://www.example.com", "limit": 100, "depth": 3, "request": "smart", "return_format": "markdown" }, stream=True, timeout=120 ) response.raise_for_status() for line in response.iter_lines(decode_unicode=True): if line: page = json.loads(line) process_page(page)

Python — Streaming with the SDK

The Python SDK has built-in streaming support via the stream and callback parameters.

Python SDK — Streaming

from spider import Spider app = Spider() def handle_page(page: dict) -> None: print(f"Crawled: {page['url']} ({page['status']})") result = app.crawl_url( "https://www.example.com", params={ "limit": 100, "depth": 3, "request": "smart", "return_format": "markdown" }, stream=True, callback=handle_page, )

Node.js — Streaming

Use the Fetch API with a ReadableStream to process each line as it arrives.

Node.js — Streaming Responses

const response = await fetch('https://api.spider.cloud/crawl', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.SPIDER_API_KEY}`, 'Content-Type': 'application/jsonl', }, body: JSON.stringify({ url: 'https://www.example.com', limit: 100, depth: 3, request: 'smart', return_format: 'markdown', }), }); const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n'); buffer = lines.pop(); // Keep incomplete line in buffer for (const line of lines) { if (line.trim()) { const page = JSON.parse(line); console.log(`Crawled: ${page.url} (${page.status})`); } } }