Concurrent Streaming
Process crawl results in real time instead of waiting for the entire job to finish. Streaming is the recommended approach for any crawl over a few dozen pages. It reduces memory usage, avoids HTTP timeouts, and gives you sub-second time-to-first-result.
How Streaming Works
Set the Content-Type header to application/jsonl and enable streaming in your HTTP client. Spider sends each page as a newline-delimited JSON object the moment it finishes processing. The crawler still runs at full concurrency behind the scenes. Pages arrive as fast as they are crawled, not in any fixed order.
When to Use Streaming
Streaming is particularly valuable in three scenarios:
- Large crawls: Any crawl with
limitover 50 pages. Without streaming, the response buffer grows until the entire crawl completes, which can hit HTTP timeouts or memory limits. - Real-time pipelines: When you need to process, embed, or store data as it arrives, for example feeding a RAG pipeline or vector database during the crawl.
- Progress tracking: Each streamed line tells you which page was crawled and its status, so you can show progress to users or log metrics incrementally.
Python: Streaming with the API
Use the requests library with stream=True and iterate over lines as they arrive.
Python: Streaming Responses
import requests, json, os
def process_page(page: dict):
url = page.get("url", "unknown")
status = page.get("status", 0)
content = page.get("content", "")
print(f"Crawled: {url} ({status}) - {len(content)} chars")
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl',
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://www.example.com",
"limit": 100,
"depth": 3,
"request": "smart",
"return_format": "markdown"
},
stream=True,
timeout=120
)
response.raise_for_status()
for line in response.iter_lines(decode_unicode=True):
if line:
page = json.loads(line)
process_page(page)Python: Streaming with the SDK
The Python SDK has built-in streaming support via the stream and callback parameters.
Python SDK: Streaming
from spider import Spider
app = Spider()
def handle_page(page: dict) -> None:
print(f"Crawled: {page['url']} ({page['status']})")
result = app.crawl_url(
"https://www.example.com",
params={
"limit": 100,
"depth": 3,
"request": "smart",
"return_format": "markdown"
},
stream=True,
callback=handle_page,
)Node.js: Streaming
Use the Fetch API with a ReadableStream to process each line as it arrives.
Node.js: Streaming Responses
const response = await fetch('https://api.spider.cloud/crawl', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.SPIDER_API_KEY}`,
'Content-Type': 'application/jsonl',
},
body: JSON.stringify({
url: 'https://www.example.com',
limit: 100,
depth: 3,
request: 'smart',
return_format: 'markdown',
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop(); // Keep incomplete line in buffer
for (const line of lines) {
if (line.trim()) {
const page = JSON.parse(line);
console.log(`Crawled: ${page.url} (${page.status})`);
}
}
}