POST /pipeline/*

AI Data Extraction

Deprecated

Go beyond raw content extraction. Spider's AI pipelines understand web pages semantically, pulling out contacts and leads, generating question-answer pairs, categorizing websites, and filtering links based on relevance. Crawl + AI in one integrated workflow.

Start Extracting Try in Playground

⚠

This endpoint is deprecated

AI Extraction pipelines are being replaced by more flexible alternatives. For structured data extraction, use the Fetch API with AI-discovered configs. For custom extraction, use Scrape with css_extraction_map.

Four AI Pipelines

Extract Contacts

POST /pipeline/extract-contacts

Crawl websites and use AI to identify and extract contact information including email addresses, phone numbers, social profiles, and business details. Results are stored and queryable via the contacts data API.

Email discovery Phone numbers Social profiles Company data

Questions & Answers

POST /pipeline/extract-qa

Crawl a website and generate structured Q&A pairs from its content. Provide an inquiry or topic and Spider produces relevant questions with answers grounded in the actual page content.

FAQ generation Topic-focused Training data Knowledge bases

Label Website

POST /pipeline/label

Crawl a website and have AI categorize it into topics, industries, or custom labels. Useful for building directories, classifying leads, or organizing large collections of URLs.

Auto-categorization Industry detection Custom labels Topic tagging

Filter Links

POST /pipeline/filter-links

Crawl a website's links and use AI to filter them based on relevance, content type, or custom criteria. Keep only the URLs that match your data collection goals, eliminating noise.

AI relevance scoring Content-type filtering Custom criteria Noise elimination

Bonus: Crawl from Text

POST /pipeline/crawl-text

Paste raw text or markdown containing URLs, and Spider will automatically extract every link and crawl them. Skip the step of parsing URLs yourself. Just send the document, email body, or notes and let Spider handle discovery. Supports up to 10 MB of input text.

Code Examples

Python cURL JavaScript

from spider import Spider

client = Spider()

# Extract contacts from a company website
contacts = client.extract_contacts(
    "https://example.com",
    params={
        "limit": 50,
    }
)

for contact in contacts:
    print(contact)

curl -X POST https://api.spider.cloud/pipeline/extract-qa \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 25,
    "return_format": "markdown"
  }'

import Spider from "@spider-cloud/spider-client";

const client = new Spider();

const result = await client.label(
  "https://example.com",
  { limit: 10 }
);

console.log(result);
// [{ url: "...", labels: ["Technology", "SaaS", "Developer Tools"] }]

Popular Use Cases

Sales Lead Generation — Crawl target company websites to extract emails, phone numbers, and team member details. Build prospect lists automatically instead of manual research.
Fine-Tuning Datasets — Generate Q&A pairs from documentation sites and knowledge bases to create training data for domain-specific language models and chatbots.
Website Directories — Label and categorize large collections of URLs for building topical directories, industry databases, or content recommendation systems.
Smart Link Discovery — Filter a website's links to find only product pages, blog posts, or documentation. Skip navigation, legal pages, and irrelevant content.

AI Data Extraction

This endpoint is deprecated

Four AI Pipelines

Extract Contacts

Questions & Answers

Label Website

Filter Links

Bonus: Crawl from Text

Code Examples

Popular Use Cases

Related Resources

Crawl API

Lead Generation Guide

AI Training Data

Extract intelligence from any website