Scraping and Crawling

Scraping Single Page

Once you have an API key, use any of our SDK libraries or direct API. We recommend testing with a variety of URLs and assess the responses you get back. The following are settings to try first:

Single Page Scrape Using API in Python

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "request": "smart", # Automatically decides which mode to use "return_format": "markdown", # LLM friendly format "proxy_enabled": True } response = requests.post( 'https://api.spider.cloud/scrape', # set to scrape endpoint headers=headers, json=params ) print(response.json())

Crawling

To crawl a starting URL use the crawl endpoint, or use the crawl_url method in the SDK library. Set max limit of pages to crawl. When testing your code with a website that potentially may have thousands of links, it's useful to set a maximum limit that is reasonable before crawling more (e.g. limit of 30 pages).

Crawling Pages Using API in Python

import requests import os headers = { 'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}', 'Content-Type': 'application/json', } params = { "url": "https://www.example.com", "limit": 30, # Maximum number of pages to crawl "depth": 3, # Reasonable depth for small sites "request": "smart", "return_format": "markdown" } response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params ) print(response.json())

Example Response

Example Response from Crawling and Scraping

import requests import os # ... truncated code response = requests.post( 'https://api.spider.cloud/crawl', headers=headers, json=params ) for result in response: print(result['content']) # Main HTML content or text print(result['url']) # URL of the page print(result['status']) # HTTP Status code print(result['error']) # Error message if available print(result['costs']) # Cost breakdown in USD for the page

Request Types

  • smart (default): Automatically determines whether to use http or chrome based on heuristics or when javascript is needed to render the page.
  • http: Performs basic HTTP request. This is the fastest and most cost-efficient option, ideal for pages with static content or simple HTML responses.
  • chrome: Uses a headless Chrome browser to fetch the page. For pages that need JavaScript rendering or need to run interactions on the page. This may be slower in comparison to http/smart modes.

Streaming Responses

Use streaming to take full advantage of the crawling speed and process your requests as they are finished. Check out the docs on streaming.