Concurrent Streaming
Process crawl results in real time instead of waiting for the entire job to finish. Streaming is the recommended approach for any crawl over a few dozen pages — it reduces memory usage, avoids HTTP timeouts, and gives you sub-second time-to-first-result.
How Streaming Works
Set the Content-Type header to application/jsonl and enable streaming in your HTTP client. Spider sends each page as a newline-delimited JSON object the moment it finishes processing. The crawler still runs at full concurrency behind the scenes — pages arrive as fast as they are crawled, not in any fixed order.
When to Use Streaming
Streaming is particularly valuable in three scenarios:
- Large crawls: Any crawl with
limitover 50 pages. Without streaming, the response buffer grows until the entire crawl completes, which can hit HTTP timeouts or memory limits. - Real-time pipelines: When you need to process, embed, or store data as it arrives — for example, feeding a RAG pipeline or vector database during the crawl.
- Progress tracking: Each streamed line tells you which page was crawled and its status, so you can show progress to users or log metrics incrementally.
Python — Streaming with the API
Use the requests library with stream=True and iterate over lines as they arrive.
Python — Streaming Responses
import requests, json, os
def process_page(page: dict):
url = page.get("url", "unknown")
status = page.get("status", 0)
content = page.get("content", "")
print(f"Crawled: {url} ({status}) — {len(content)} chars")
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://www.example.com",
"limit": 100,
"depth": 3,
"request": "smart",
"return_format": "markdown"
},
stream=True,
timeout=120
)
response.raise_for_status()
for line in response.iter_lines(decode_unicode=True):
if line:
page = json.loads(line)
process_page(page)Python — Streaming with the SDK
The Python SDK has built-in streaming support via the stream and callback parameters.
Python SDK — Streaming
from spider import Spider
app = Spider()
def handle_page(page: dict) -> None:
print(f"Crawled: {page['url']} ({page['status']})")
result = app.crawl_url(
"https://www.example.com",
params={
"limit": 100,
"depth": 3,
"request": "smart",
"return_format": "markdown"
},
stream=True,
callback=handle_page,
)Node.js — Streaming
Use the Fetch API with a ReadableStream to process each line as it arrives.
Node.js — Streaming Responses
const response = await fetch('https://api.spider.cloud/crawl', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.SPIDER_API_KEY}`,
'Content-Type': 'application/jsonl',
},
body: JSON.stringify({
url: 'https://www.example.com',
limit: 100,
depth: 3,
request: 'smart',
return_format: 'markdown',
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop(); // Keep incomplete line in buffer
for (const line of lines) {
if (line.trim()) {
const page = JSON.parse(line);
console.log(`Crawled: ${page.url} (${page.status})`);
}
}
}