Guides / Spider Platform

Spider Platform

A practical walkthrough for collecting web data with Spider, from your first crawl to production pipelines.

4 min read Jeff Mendez

Getting started collecting data with Spider

Spider is a web crawling and scraping platform built from the ground up in Rust for speed and reliability. It handles proxy rotation, JavaScript rendering, rate limiting, and anti-bot detection so you can focus on what to do with the data.

This guide covers the fundamentals: setting up your account, running your first crawl, and configuring Spider to handle real-world scraping workloads.

What Spider Does

Spider provides a single API that turns any URL into structured data. Here is what that looks like in practice:

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

# Crawl a site and get LLM-ready markdown
response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com",
    "limit": 10,
    "return_format": "markdown",
    "request": "smart"
  }
)

for page in response.json():
    print(f"{page['url']}{len(page['content'])} chars")

Under the hood, Spider handles:

  • Concurrent crawling: thousands of pages per minute using a Rust-based engine
  • JavaScript rendering: headless Chrome for SPAs and dynamic content
  • Proxy rotation: datacenter, residential, and mobile proxies with automatic failover
  • Output formatting: HTML, markdown, plain text, screenshots, or structured JSON
  • Streaming: process results as they arrive instead of waiting for the full crawl
  • AI extraction: pull structured fields from pages using built-in LLM integration

Getting Set Up

Using the Dashboard

  1. Register or sign in with email or GitHub.
  2. Purchase credits to start crawling. Credits work on a pay-as-you-go model.
  3. Navigate to the dashboard and enter a URL to crawl.
  4. Export the results as CSV, JSON, or download directly.

The dashboard is the fastest way to test a URL and see what Spider returns before writing any code.

Using the API

For production workloads, the API gives you full control over crawl parameters, output format, and delivery.

  1. Create an API key from your account.
  2. Store it as an environment variable:
export SPIDER_API_KEY="your_key_here"
  1. Make your first request:
curl 'https://api.spider.cloud/scrape' \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "return_format": "markdown"}'

The API reference documents every endpoint and parameter. Client libraries are available for Python, JavaScript, and Rust.

Crawl Configuration

Adjusting a few settings before you crawl can save you credits and improve the quality of your results.

Request Modes

The request parameter controls how Spider fetches each page:

ModeWhen to use
smart (default)Automatically picks HTTP or Chrome based on page requirements
httpStatic pages, sitemaps, APIs. Fastest and cheapest
chromeSPAs, JS-rendered content, pages behind Cloudflare or similar protections

Proxies

Enable proxy_enabled: true to route requests through Spider’s proxy network. This significantly reduces blocks on sites with anti-bot protections. For tougher targets, specify proxy_type as residential or mobile.

Proxies example app screenshot.

Headless Browser

Set request: "chrome" to render pages in a real Chrome browser. This is required for single-page applications and sites that load content dynamically with JavaScript.

Headless browser example app screenshot.

Crawl Budget Limits

Set limit to cap the number of pages Spider will crawl from a starting URL. This is critical for controlling costs on large sites.

params = {
    "url": "https://docs.example.com",
    "limit": 50,       # Stop after 50 pages
    "depth": 2,        # Only follow links 2 levels deep
    "return_format": "markdown"
}

You can also configure budgets per-domain in your account settings using wildcard patterns. The example below limits all routes to 50 pages maximum:

Crawl budget example screenshot

Transforming Data

The return_format parameter controls what Spider gives you back:

FormatOutputBest for
rawOriginal HTMLParsing with your own tools
markdownClean markdownLLM ingestion, RAG pipelines
textPlain textSearch indexing, NLP tasks
bytesRaw bytesBinary content, downloads

For AI and LLM workflows, markdown strips out navigation, ads, and boilerplate, giving you just the page content. This pairs well with streaming for real-time ingestion into vector databases.

# Get markdown for a RAG pipeline
response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://docs.example.com",
    "limit": 100,
    "return_format": "markdown",
    "request": "smart"
  }
)

for page in response.json():
    # Each page is clean markdown ready for chunking
    chunks = split_into_chunks(page['content'])
    embed_and_store(chunks, metadata={"url": page['url']})

Streaming Large Crawls

For crawls over a few dozen pages, use streaming to process results as they arrive. Set the Content-Type header to application/jsonl and enable stream=True in your HTTP client:

import requests, json, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/jsonl',
}

response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com",
    "limit": 200,
    "return_format": "markdown",
    "request": "smart"
  },
  stream=True
)

with response as r:
    r.raise_for_status()
    for line in r.iter_lines(decode_unicode=True):
        page = json.loads(line)
        print(f"Crawled: {page['url']} ({page['status']})")

Streaming reduces memory usage, gives you faster time-to-first-result, and avoids HTTP timeouts on long crawls. See the streaming docs for more details.

Open Source

The core crawling engine is fully open source at github.com/spider-rs/spider under the MIT license. Spider Cloud adds managed infrastructure, proxies, and the API layer on top. If you want to self-host or contribute, the open source project is the place to start.

Credits and Pricing

Spider uses a credit-based system where $1 = 10,000 credits. Credits are deducted per page based on the features used (proxies, Chrome rendering, AI extraction, etc.). You can track your usage on the usage page.

When you purchase credits, a subscription is created that allows pay-as-you-go usage when your balance runs out. The spending limit scales with your purchase history. A $5 purchase gives roughly $40 in spending capacity.

For more details, see the pricing page.

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.