Spider Blog - From Web Page to Vector Database: The Complete Pipeline

Every RAG application, every semantic search engine, every AI agent that can “look things up” depends on the same fundamental plumbing: a pipeline that turns web pages into vectors and stores them somewhere queryable. The concept is simple. The implementation has a surprising number of places where things break.

This post walks through every stage of that pipeline, from raw URL to indexed embedding. For each stage, we cover what it does, what goes wrong in practice, and how to build it so it holds up at scale. All code examples are working Python that you can copy, adapt, and deploy.

The pipeline at a glance

The full pipeline has eight stages:

URL discovery: finding the pages you need to crawl
Crawling: fetching page content over HTTP or headless browser
Content extraction: pulling meaningful content out of raw HTML
Cleaning: removing boilerplate, navigation, ads, and artifacts
Chunking: splitting content into pieces sized for embedding models
Embedding: converting text chunks into dense vectors
Indexing: writing vectors to a database with metadata
Querying: retrieving relevant chunks at inference time

Each stage feeds the next. A failure or quality drop at any point propagates downstream. Bad HTML extraction produces noisy chunks, noisy chunks produce poor embeddings, and poor embeddings mean your retrieval misses the answer even when it was on the page you crawled.

The rest of this post goes stage by stage.

Stage 1: URL discovery

Before you crawl anything, you need a list of URLs. For a single domain, this usually means one of three approaches:

Sitemap parsing: fetch /sitemap.xml and extract all <loc> entries. Fast, but sitemaps are often incomplete or stale.
Recursive link following: start at the root URL and follow internal links. Thorough, but requires deduplication and cycle detection.
Hybrid: use the sitemap as a seed list, then discover additional pages through link following.

What goes wrong

Sitemaps frequently omit dynamic or recently published pages. Link following can explode into millions of URLs on large sites if you do not set depth and page limits. Duplicate URLs with different query parameters, trailing slashes, or fragment identifiers inflate your crawl without adding content.

How Spider handles it

Spider’s /crawl endpoint accepts a root URL, follows internal links automatically, respects robots.txt, deduplicates URLs with query parameter normalization, and lets you cap the crawl with a limit parameter. You can also pass sitemap: true to seed from the sitemap first. The default smart mode inspects each page before deciding whether to use a lightweight HTTP fetch or a full headless Chrome render.

import requests
import os

# Discover and crawl up to 500 pages from a domain
response = requests.post(
    "https://api.spider.cloud/crawl",
    headers={
        "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
        "Content-Type": "application/json",
    },
    json={
        "url": "https://docs.example.com",
        "limit": 500,
        "return_format": "markdown",
        "request": "smart",
    },
)

pages = response.json()
print(f"Crawled {len(pages)} pages")

Stage 2: Crawling

Crawling is the act of making HTTP requests and getting page content back. For static sites, a plain GET request with the right headers is sufficient. For JavaScript-heavy SPAs, you need a headless browser that executes JavaScript before returning the DOM.

What goes wrong

Rate limiting and IP blocking are the most common failures. Sites behind Cloudflare, Akamai, or Imperva will serve CAPTCHAs or 403 responses to unrecognized clients. Headless browser detection (checking for navigator.webdriver, missing browser plugins, or suspicious viewport sizes) blocks naive Puppeteer and Playwright setups. Connection timeouts, TLS errors, and redirect loops are everyday occurrences at scale.

Retry logic matters. A simple retry-on-failure loop without backoff and jitter will get your IP banned faster. You also need to handle partial failures gracefully: if 3 out of 500 pages fail, the pipeline should continue with the 497 that succeeded.

How Spider handles it

Spider’s crawl engine is written in Rust and handles retries, proxy rotation (datacenter, residential, and mobile IPs), anti-bot bypass, and browser fingerprint management internally. The smart mode inspects each page and picks the cheapest path: HTTP for static content, Chrome only when rendering is required. Success rate across production traffic sits around 99%.

For large crawls, switch the Content-Type header to application/jsonl (NDJSON) to stream results as they arrive instead of buffering the entire response in memory.

Stage 3: Content extraction

Raw HTML contains the content you want buried inside navigation bars, headers, footers, cookie banners, sidebar widgets, ad scripts, and tracking pixels. Extracting the meaningful content is where most pipelines first lose quality.

Common approaches

CSS selector extraction: fragile, breaks when sites redesign, requires per-site maintenance.
Readability algorithms: Mozilla’s Readability (used in Firefox Reader View) works well for article-style pages but struggles with documentation, product pages, and forums.
LLM-based extraction: accurate but slow and expensive at scale.
Markdown conversion with boilerplate removal: converts the DOM to markdown while stripping non-content elements. This is the sweet spot for RAG pipelines.

What goes wrong

Raw HTML extraction keeps too much noise. <nav>, <footer>, <aside>, and <script> tags all end up in your chunks. CSS-based extraction requires constant maintenance. Readability-style algorithms make binary decisions about what is “content” and what is not, which works for blog posts but fails on pages with mixed content types.

How Spider handles it

When you set return_format: "markdown", Spider converts the page to clean markdown with navigation, ads, footers, and boilerplate already stripped. The conversion happens server-side in the Rust pipeline, so you receive content that is ready to chunk without a separate cleaning step.

For structured extraction, you can pass a natural language prompt in the extraction_prompt parameter and receive structured JSON. No CSS selectors, no XPath.

Stage 4: Cleaning

Even after extraction, content often needs a cleaning pass before chunking. Common issues include:

Repeated header/footer text that appears on every page of a site
Empty links and image alt text that add noise without semantic value
Excessive whitespace, unicode artifacts, and encoding issues
Boilerplate disclaimers or copyright notices

Practical cleaning code

import re

def clean_markdown(text: str) -> str:
    """Clean extracted markdown for chunking."""
    # Remove image references without meaningful alt text
    text = re.sub(r"!\[\]\([^)]*\)", "", text)
    # Collapse multiple blank lines
    text = re.sub(r"\n{3,}", "\n\n", text)
    # Remove zero-width characters
    text = re.sub(r"[\u200b\u200c\u200d\ufeff]", "", text)
    # Strip leading/trailing whitespace per line
    lines = [line.strip() for line in text.split("\n")]
    text = "\n".join(lines)
    return text.strip()

When using Spider’s markdown output, most of this cleaning is already done. The function above catches edge cases that any upstream extractor might miss.

Stage 5: Chunking

Chunking is where the pipeline’s retrieval quality is won or lost. Embedding models have fixed context windows (typically 512 to 8,192 tokens). Documents longer than that window must be split into chunks, and how you split them determines whether a query retrieves the right passage or a fragment that cuts off mid-sentence.

Three chunking strategies

Fixed-size chunking splits text every N characters or tokens with an overlap window. It is simple and predictable, but it ignores document structure. A chunk boundary can land in the middle of a paragraph, a code block, or a table row.

def fixed_size_chunks(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Recursive character splitting (popularized by LangChain) tries a hierarchy of separators: first double newlines (paragraph breaks), then single newlines, then sentences, then words. It respects document structure better than fixed-size splitting because it prefers to break at natural boundaries.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=128,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_text(cleaned_markdown)

Semantic chunking uses an embedding model to detect topic shifts and places chunk boundaries where the semantic similarity between adjacent sentences drops. It produces the most coherent chunks but is significantly slower because it requires embedding every sentence during the chunking step.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
)

chunks = semantic_splitter.split_text(cleaned_markdown)

Why input format matters for chunking

The format of the text you feed into the chunker has a direct impact on retrieval quality. Consider the same page in three formats:

Raw HTML: <div class="post-content"><p>Vector databases store...</p><p>The indexing process...</p></div>

Recursive splitting on HTML will break at tag boundaries that have nothing to do with semantic boundaries. A <div> closing tag is not a meaningful place to end a chunk. Navigation markup, inline styles, and script tags inflate token counts without adding retrievable content.

Spider markdown: ## Vector Databases\n\nVector databases store...\n\nThe indexing process...

Markdown gives the splitter clean paragraph breaks (\n\n) and heading markers (##) to use as natural boundaries. Chunks align with the document’s actual structure. No tokens are wasted on markup.

This difference compounds across thousands of pages. Cleaner input means smaller chunks (fewer wasted tokens), better boundary placement, and higher retrieval precision.

This difference compounds at scale. If you index 10,000 pages from 50 different sites, the template noise from raw HTML dominates the embedding space. Pages from the same site cluster together by layout rather than by content. Markdown-first pipelines avoid this problem entirely.

Stage 6: Embedding

Embedding converts each text chunk into a dense vector (typically 256 to 3,072 dimensions) that captures its semantic meaning. Similar chunks produce vectors that are close together in the embedding space, which is what makes vector search work.

Embedding model options

OpenAI text-embedding-3-small (1,536 dimensions, $0.02 per 1M tokens): the current default for most production RAG systems. Good accuracy, low cost, fast inference. Supports dimensions parameter to reduce output size (e.g., 512 dimensions for 3x storage savings with minimal accuracy loss).

from openai import OpenAI

client = OpenAI()

def embed_chunks(chunks: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Embed a batch of text chunks."""
    response = client.embeddings.create(
        input=chunks,
        model=model,
    )
    return [item.embedding for item in response.data]

Cohere embed-v3 (1,024 dimensions, $0.10 per 1M tokens): supports input_type parameter that optimizes embeddings differently for documents vs. queries. This asymmetric embedding can improve retrieval accuracy by 2-5% in benchmarks. Multilingual support is strong.

import cohere

co = cohere.Client()

def embed_chunks_cohere(chunks: list[str]) -> list[list[float]]:
    """Embed chunks using Cohere with document input type."""
    response = co.embed(
        texts=chunks,
        model="embed-english-v3.0",
        input_type="search_document",
    )
    return response.embeddings

Open source alternatives: for teams that need to avoid sending data to external APIs, sentence-transformers models run locally. BAAI/bge-large-en-v1.5 (1,024 dimensions) and nomic-ai/nomic-embed-text-v1.5 (768 dimensions) both perform competitively on MTEB benchmarks. You trade throughput and operational simplicity for data privacy and zero marginal cost.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def embed_chunks_local(chunks: list[str]) -> list[list[float]]:
    """Embed chunks using a local model."""
    return model.encode(chunks, normalize_embeddings=True).tolist()

What goes wrong

Embedding API rate limits are the most common bottleneck. OpenAI’s embedding endpoint handles up to 3,000 requests per minute for most accounts, but each request can contain up to 2,048 chunks (batched). Sending chunks one at a time instead of batching is a common mistake that slows the pipeline by 100x.

Token truncation is another silent failure. If a chunk exceeds the model’s max input tokens (8,191 for text-embedding-3-small), the API silently truncates it. The embedding then represents only the first portion of the chunk. Pre-validate chunk sizes before embedding.

Stage 7: Indexing

Once you have vectors, you need to store them in a database that supports approximate nearest neighbor (ANN) search. The four most common options each have different operational characteristics.

Pinecone

Managed, serverless vector database. No infrastructure to run. Scales automatically. Pod-based pricing for dedicated resources, or serverless pricing based on reads, writes, and storage.

from pinecone import Pinecone

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create an index (once)
pc.create_index(
    name="spider-docs",
    dimension=1536,
    metric="cosine",
    spec={"serverless": {"cloud": "aws", "region": "us-east-1"}},
)

index = pc.Index("spider-docs")

# Upsert vectors with metadata
vectors = [
    {
        "id": f"chunk_{i}",
        "values": embedding,
        "metadata": {
            "url": page_url,
            "text": chunk_text,
            "title": page_title,
        },
    }
    for i, (embedding, chunk_text) in enumerate(zip(embeddings, chunks))
]

# Batch upsert (max 100 vectors per request)
for i in range(0, len(vectors), 100):
    index.upsert(vectors=vectors[i : i + 100])

Weaviate

Open source vector database with a managed cloud option. Supports hybrid search (combining vector similarity with BM25 keyword matching). Stores objects with properties alongside vectors, so you do not need a separate metadata store.

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WEAVIATE_API_KEY")),
)

# Create collection (once)
collection = client.collections.create(
    name="SpiderDocs",
    vectorizer_config=Configure.Vectorizer.none(),
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="url", data_type=DataType.TEXT),
        Property(name="title", data_type=DataType.TEXT),
    ],
)

# Insert vectors
spider_docs = client.collections.get("SpiderDocs")

with spider_docs.batch.dynamic() as batch:
    for chunk_text, embedding, page_url, page_title in zip(
        chunks, embeddings, urls, titles
    ):
        batch.add_object(
            properties={
                "text": chunk_text,
                "url": page_url,
                "title": page_title,
            },
            vector=embedding,
        )

client.close()

Chroma

Lightweight, open source, embeddable vector database. Runs in-process with no server required (SQLite backend). Good for prototyping and small to medium datasets. Also offers a hosted cloud service.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.get_or_create_collection(
    name="spider_docs",
    metadata={"hnsw:space": "cosine"},
)

# Add vectors with documents and metadata
collection.add(
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    embeddings=embeddings,
    documents=chunks,
    metadatas=[
        {"url": url, "title": title}
        for url, title in zip(urls, titles)
    ],
)

pgvector

PostgreSQL extension that adds vector similarity search to an existing Postgres database. If you already run Postgres, pgvector avoids adding another database to your stack. Supports IVFFlat and HNSW indexes.

import psycopg2

conn = psycopg2.connect(os.getenv("DATABASE_URL"))
cur = conn.cursor()

# Enable extension and create table (once)
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        url TEXT,
        title TEXT,
        chunk_text TEXT,
        embedding vector(1536)
    );
""")

# Create HNSW index for fast similarity search
cur.execute("""
    CREATE INDEX IF NOT EXISTS documents_embedding_idx
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
""")

# Insert vectors
for chunk_text, embedding, page_url, page_title in zip(
    chunks, embeddings, urls, titles
):
    cur.execute(
        """
        INSERT INTO documents (url, title, chunk_text, embedding)
        VALUES (%s, %s, %s, %s)
        """,
        (page_url, page_title, chunk_text, embedding),
    )

conn.commit()
cur.close()
conn.close()

Choosing a vector database

	Pinecone	Weaviate	Chroma	pgvector
Hosting	Managed only	Self-hosted or cloud	Embedded or cloud	Self-hosted (any Postgres)
Max vectors (free tier)	~100K	50K objects	Unlimited (local)	Unlimited (local)
Hybrid search	No	Yes (BM25 + vector)	No	Yes (with pg_trgm)
Setup complexity	Low	Medium	Very low	Low (if Postgres exists)
Best for	Production SaaS	Feature-rich apps	Prototyping, small apps	Existing Postgres stacks

Stage 8: Querying

At inference time, the user’s question is embedded with the same model used to embed the chunks, then used as a query vector to retrieve the top-K most similar chunks.

def retrieve(query: str, collection, k: int = 5) -> list[dict]:
    """Retrieve relevant chunks from Chroma."""
    results = collection.query(
        query_texts=[query],
        n_results=k,
    )
    return [
        {"text": doc, "url": meta["url"], "score": score}
        for doc, meta, score in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

What goes wrong

Embedding model mismatch: if you embed documents with text-embedding-3-small but queries with text-embedding-ada-002, similarity scores are meaningless. Always use the same model (and same dimensions parameter) for both indexing and querying.

Missing metadata filtering: without metadata, you cannot scope queries to a specific domain, date range, or content type. Always store the source URL, crawl date, and page title alongside the vector.

Stale data: web pages change. A pipeline that indexes once and never re-crawls will serve outdated answers. Schedule periodic re-crawls and implement upsert logic that replaces stale chunks rather than duplicating them.

The full pipeline: end-to-end code

Here is the complete pipeline, from URL to queryable vector store, in a single script. It uses Spider for crawling, OpenAI for embedding, and Chroma for storage.

"""
Full pipeline: URL -> Spider crawl -> clean -> chunk -> embed -> index -> query.
Requirements: pip install requests langchain-text-splitters openai chromadb
"""

import os
import re
import requests
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI
import chromadb


# --- Configuration ---
SPIDER_API_KEY = os.getenv("SPIDER_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
TARGET_URL = "https://docs.example.com"
CRAWL_LIMIT = 100
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 128
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_BATCH_SIZE = 256


# --- Stage 1-2: Crawl with Spider ---
def crawl(url: str, limit: int) -> list[dict]:
    """Crawl a site and return pages as markdown."""
    response = requests.post(
        "https://api.spider.cloud/crawl",
        headers={
            "Authorization": f"Bearer {SPIDER_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "limit": limit,
            "return_format": "markdown",
            "request": "smart",
        },
    )
    response.raise_for_status()
    return response.json()


# --- Stage 3-4: Clean ---
def clean_markdown(text: str) -> str:
    """Clean extracted markdown for chunking."""
    text = re.sub(r"!\[\]\([^)]*\)", "", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[\u200b\u200c\u200d\ufeff]", "", text)
    lines = [line.strip() for line in text.split("\n")]
    return "\n".join(lines).strip()


# --- Stage 5: Chunk ---
def chunk_pages(pages: list[dict]) -> tuple[list[str], list[dict]]:
    """Chunk all pages and track metadata."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    all_chunks = []
    all_metadata = []

    for page in pages:
        content = clean_markdown(page.get("content", ""))
        if not content or len(content) < 50:
            continue

        url = page.get("url", "")
        # Extract title from first markdown heading
        title_match = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
        title = title_match.group(1) if title_match else url

        chunks = splitter.split_text(content)
        for chunk in chunks:
            all_chunks.append(chunk)
            all_metadata.append({"url": url, "title": title})

    return all_chunks, all_metadata


# --- Stage 6: Embed ---
def embed_batched(
    chunks: list[str], model: str, batch_size: int
) -> list[list[float]]:
    """Embed chunks in batches to stay within API limits."""
    client = OpenAI()
    all_embeddings = []

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        response = client.embeddings.create(input=batch, model=model)
        all_embeddings.extend([item.embedding for item in response.data])
        print(f"  Embedded {min(i + batch_size, len(chunks))}/{len(chunks)}")

    return all_embeddings


# --- Stage 7: Index ---
def index_vectors(
    chunks: list[str],
    embeddings: list[list[float]],
    metadata: list[dict],
    collection_name: str = "spider_pipeline",
) -> chromadb.Collection:
    """Index vectors in Chroma."""
    client = chromadb.PersistentClient(path="./pipeline_db")
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},
    )

    ids = [f"chunk_{i}" for i in range(len(chunks))]

    # Chroma supports batch add up to ~40K at a time
    batch = 5000
    for i in range(0, len(chunks), batch):
        end = min(i + batch, len(chunks))
        collection.add(
            ids=ids[i:end],
            embeddings=embeddings[i:end],
            documents=chunks[i:end],
            metadatas=metadata[i:end],
        )

    return collection


# --- Stage 8: Query ---
def query(collection, question: str, k: int = 5) -> list[dict]:
    """Query the vector store."""
    results = collection.query(query_texts=[question], n_results=k)
    return [
        {"text": doc, "url": meta["url"], "title": meta["title"]}
        for doc, meta in zip(
            results["documents"][0],
            results["metadatas"][0],
        )
    ]


# --- Run the pipeline ---
if __name__ == "__main__":
    print(f"Crawling {TARGET_URL} (limit: {CRAWL_LIMIT})...")
    pages = crawl(TARGET_URL, CRAWL_LIMIT)
    print(f"  Got {len(pages)} pages")

    print("Chunking...")
    chunks, metadata = chunk_pages(pages)
    print(f"  Generated {len(chunks)} chunks")

    print("Embedding...")
    embeddings = embed_batched(chunks, EMBEDDING_MODEL, EMBEDDING_BATCH_SIZE)
    print(f"  Got {len(embeddings)} embeddings")

    print("Indexing...")
    collection = index_vectors(chunks, embeddings, metadata)
    print("  Done")

    print("\nPipeline complete. Running test query...")
    results = query(collection, "How do I get started?")
    for i, r in enumerate(results):
        print(f"\n--- Result {i + 1} ---")
        print(f"URL: {r['url']}")
        print(f"Title: {r['title']}")
        print(f"Text: {r['text'][:200]}...")

Cost breakdown: 10,000 pages

Here is what the full pipeline costs for a realistic workload of 10,000 pages, assuming an average page produces 3 chunks of ~400 tokens each.

Crawling

Provider	Estimated cost per 1K pages	10K pages	Notes
Spider (smart mode)	~$0.48	~$4.80	Pay-as-you-go
Firecrawl	~$3-5	~$30-50	Varies by plan tier
ScrapingBee (JS mode)	~$0.98	~$9.80	5x credit multiplier for JS

Embedding

10,000 pages x 3 chunks/page = 30,000 chunks. At ~400 tokens per chunk, that is 12M tokens.

Model	Cost per 1M tokens	12M tokens
OpenAI text-embedding-3-small	$0.02	$0.24
OpenAI text-embedding-3-large	$0.13	$1.56
Cohere embed-v3	$0.10	$1.20
Local (bge-large)	$0.00 (compute only)	$0.00

Vector database hosting (monthly)

Database	Free tier	Paid tier (100K vectors)
Pinecone (serverless)	100K vectors, 2M reads	~$8/mo
Weaviate Cloud	50K objects	~$25/mo
Chroma (local)	Unlimited	$0 (your hardware)
pgvector (existing Postgres)	N/A	$0 incremental

Total pipeline cost

For 10,000 pages with Spider + OpenAI text-embedding-3-small + Chroma (local):

Component	Cost
Crawling (Spider)	$4.80
Embedding (OpenAI)	$0.24
Vector storage (Chroma)	$0.00
Total	$6.74

The same pipeline with a more expensive crawler and embedding model:

Component	Cost
Crawling (Firecrawl)	$40.00
Embedding (Cohere)	$1.20
Vector storage (Pinecone)	$8.00/mo
Total (first month)	$49.20

Crawling is the dominant cost at scale. The choice of crawler matters more than the choice of embedding model or vector database.

Production considerations

Incremental updates

A production pipeline does not re-crawl and re-embed an entire site every time. Use Spider’s crawl endpoint with a last_modified filter or compare content hashes to detect changed pages. Only re-chunk and re-embed pages whose content has actually changed.

Deduplication

Large sites often have near-duplicate pages (paginated listings, print versions, locale variants). Before embedding, compute a content hash for each chunk and skip duplicates. This reduces storage costs and prevents duplicate chunks from diluting retrieval accuracy.

import hashlib

seen_hashes = set()
unique_chunks = []
unique_metadata = []

for chunk, meta in zip(chunks, metadata):
    h = hashlib.sha256(chunk.encode()).hexdigest()
    if h not in seen_hashes:
        seen_hashes.add(h)
        unique_chunks.append(chunk)
        unique_metadata.append(meta)

print(f"Deduplicated: {len(chunks)} -> {len(unique_chunks)} chunks")

Monitoring

Track these metrics in production:

Crawl success rate: percentage of URLs that return 200 with non-empty content.
Chunks per page: if this number suddenly spikes, content extraction may be including boilerplate.
Embedding latency: p50 and p99 per batch. Spikes indicate rate limiting.
Query latency: end-to-end time from question to retrieved chunks. Keep this under 200ms for interactive use.
Retrieval relevance: periodically sample queries and verify that the top results are actually relevant. Automated evaluation with an LLM judge can flag regressions.

Error handling

Every external call in the pipeline can fail: the crawl request, the embedding API, the vector database write. Wrap each stage in retry logic with exponential backoff. Log failures with enough context (URL, chunk index, error message) to diagnose issues without re-running the full pipeline.

Conclusion

The path from web page to queryable vector is not one step. It is eight steps, each with its own failure modes and quality trade-offs. The crawling and extraction stages determine the ceiling for everything downstream. If the content that enters your chunker is noisy, no amount of embedding model tuning or vector database optimization will compensate.

The pipeline is only as good as its weakest stage. Invest your debugging time in chunking strategy and retrieval evaluation, because those are where quality is won or lost. The crawl and extraction stages should be boring infrastructure that you set up once and forget about.

A few things this post did not cover that matter in production: re-ranking with a cross-encoder after initial vector retrieval (this is the single biggest quality improvement you can make), query expansion for the cold-start problem when users phrase queries differently from the source documents, and chunk overlap tuning (the 128-character overlap used above is a starting point, not an answer). These are the real differentiators between a demo and a production RAG system.

Get web data insights

Weekly tips on web scraping, AI pipelines, and product updates.

From Web Page to Vector Database: The Complete Pipeline

The pipeline at a glance

Stage 1: URL discovery

What goes wrong

How Spider handles it

Stage 2: Crawling

What goes wrong

How Spider handles it

Stage 3: Content extraction

Common approaches

What goes wrong

How Spider handles it

Stage 4: Cleaning

Practical cleaning code

Stage 5: Chunking

Three chunking strategies

Why input format matters for chunking

Stage 6: Embedding

Embedding model options

What goes wrong

Stage 7: Indexing

Pinecone

Weaviate

Chroma

pgvector

Choosing a vector database

Stage 8: Querying

What goes wrong

The full pipeline: end-to-end code

Cost breakdown: 10,000 pages

Crawling

Embedding

Vector database hosting (monthly)

Total pipeline cost

Production considerations

Incremental updates

Deduplication

Monitoring

Error handling

Conclusion

Get web data insights

Empower any project with AI-ready data