Concurrent Streaming
Accelerate your data collection with real-time streaming during concurrent crawls.
Streaming Responses
Streaming allows you to process each page's data as it's received, rather than waiting for the entire crawl to complete. This helps to avoid delays and reduce the risk of timeouts or data loss. We recommend enabling streaming responses when crawling large websites or when crawling with a high page limit. The crawler will take advantage of running in full concurrency to crawl and process pages. The following shows how to stream responses using Python and the jsonlines
library.
Streaming Responses in API
import requests, jsonlines, os, re
from typing import Dict
def process_item(item: Dict):
url = item.get("url", "unknown_url")
status = item.get("status", 0)
content = item.get("content", {})
print(f"URL: {url}")
print(f"Status: {status}")
print(f"Content: {content}")
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/jsonl', # set jsonl as content-type
}
params = {
"url": "https://www.example.com",
"limit": 30,
"depth": 3,
"request": "smart",
"return_format": "raw"
}
response = requests.post(
'https://api.spider.cloud/crawl',
headers=headers,
json=params,
stream=True, # set to True
timeout=60
)
response.raise_for_status()
reader = jsonlines.Reader(response.raw)
for item in reader:
try:
process_item(item)
except Exception as e:
print(f"Error processing item: {e}")
continue
Streaming Responses in Python SDK
def handle_json(json_obj: dict) -> None:
assert json_obj["url"] is not None
url = "https://www.example.com"
params = {
'limit': 30,
"depth": 3,
"request": "smart",
"return_format": "markdown"
}
response = app.crawl_url(
url,
params=params,
stream=True, # set to True
callback=handle_json,
)