Build a Production RAG Pipeline with Web Data in Under 30 Minutes
Retrieval-Augmented Generation is the most reliable way to ground LLM responses in real, up-to-date information. The concept is simple: crawl a set of web pages, convert them to embeddings, store them in a vector database, and retrieve the most relevant chunks at query time. The execution, however, trips people up at step one. Getting clean, structured text out of the web is harder than it should be.
This post walks through building a complete RAG pipeline from scratch. We will crawl a website with Spider, convert the results into vector embeddings, store them in ChromaDB, and query them with an LLM. Then we will show the same pipeline implemented in four major AI frameworks: LangChain, LlamaIndex, CrewAI, and AutoGen. Spider has native, first-class integrations in all four.
By the end, you will have working Python code you can drop into a project today.
Prerequisites
Before starting, make sure you have:
- Python 3.10 or later
- A Spider API key (sign up and grab one from the dashboard)
- An OpenAI API key (for embeddings and the LLM)
Set both as environment variables:
export SPIDER_API_KEY="your-spider-api-key"
export OPENAI_API_KEY="your-openai-api-key"
Install the base dependencies:
pip install spider-client openai chromadb tiktoken
Step 1: Crawl a Website with Spider
Spider returns clean markdown by default when you set return_format to "markdown". This is the single most important detail for RAG quality, and we will cover why in a later section.
Using the Spider Python SDK
from spider import Spider
import os
client = Spider(api_key=os.getenv("SPIDER_API_KEY"))
crawl_result = client.crawl_url(
"https://docs.example.com",
params={
"return_format": "markdown",
"limit": 50,
"request": "smart",
},
)
print(f"Crawled {len(crawl_result)} pages")
for page in crawl_result:
print(f" {page['url']}: {len(page.get('content', ''))} chars")
The smart request mode inspects each page and picks the cheapest fetch strategy: plain HTTP for static pages, headless Chrome only when JavaScript rendering is actually required. The limit parameter caps total pages so you can control costs during development.
Using Raw HTTP Requests
If you prefer not to add the SDK as a dependency, the API works with any HTTP client:
import requests
import os
response = requests.post(
"https://api.spider.cloud/crawl",
headers={
"Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
"Content-Type": "application/json",
},
json={
"url": "https://docs.example.com",
"limit": 50,
"return_format": "markdown",
"request": "smart",
},
)
pages = response.json()
print(f"Crawled {len(pages)} pages")
Both approaches return the same data: a list of objects, each containing url, content (the markdown), and status. For large crawls, set Content-Type to application/jsonl to stream results as they arrive instead of buffering them in memory.
Step 2: Chunk the Markdown
Raw pages are too long to embed as single vectors. Splitting them into overlapping chunks gives the retriever more granular, relevant results at query time.
from typing import List, Dict
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200) -> List[str]:
"""Split text into overlapping chunks by character count."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
if chunk.strip():
chunks.append(chunk.strip())
start += chunk_size - overlap
return chunks
def chunk_crawl_results(pages: List[Dict]) -> List[Dict]:
"""Chunk all crawled pages, preserving source URL metadata."""
all_chunks = []
for page in pages:
content = page.get("content", "")
url = page.get("url", "")
if not content:
continue
chunks = chunk_text(content)
for i, chunk in enumerate(chunks):
all_chunks.append({
"text": chunk,
"url": url,
"chunk_index": i,
})
return all_chunks
chunks = chunk_crawl_results(crawl_result)
print(f"Created {len(chunks)} chunks from {len(crawl_result)} pages")
A chunk size of 1,000 characters with 200-character overlap works well for most documentation and article content. For highly structured content (API references, changelogs), you may want smaller chunks (500 characters) so each chunk maps to a single concept.
Step 3: Embed and Store in a Vector Database
We will use OpenAI’s text-embedding-3-small model for embeddings and ChromaDB as the vector store. ChromaDB runs in-process with no external server required, which makes it ideal for prototyping and smaller production workloads.
import chromadb
from openai import OpenAI
openai_client = OpenAI()
# Create (or connect to) a persistent ChromaDB collection
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(
name="web_docs",
metadata={"hnsw:space": "cosine"},
)
def get_embeddings(texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]:
"""Embed a batch of texts using OpenAI's embedding API."""
response = openai_client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
# Embed and insert in batches of 100
BATCH_SIZE = 100
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i:i + BATCH_SIZE]
texts = [c["text"] for c in batch]
embeddings = get_embeddings(texts)
collection.add(
ids=[f"chunk_{i + j}" for j in range(len(batch))],
embeddings=embeddings,
documents=texts,
metadatas=[{"url": c["url"], "chunk_index": c["chunk_index"]} for c in batch],
)
print(f"Stored {collection.count()} chunks in ChromaDB")
Using Pinecone Instead
If you need a managed vector database that scales to millions of vectors without operational overhead, swap ChromaDB for Pinecone:
pip install pinecone
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = "web-docs"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index(index_name)
# Upsert in batches
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i:i + BATCH_SIZE]
texts = [c["text"] for c in batch]
embeddings = get_embeddings(texts)
vectors = [
{
"id": f"chunk_{i + j}",
"values": emb,
"metadata": {"text": c["text"], "url": c["url"]},
}
for j, (c, emb) in enumerate(zip(batch, embeddings))
]
index.upsert(vectors=vectors)
Step 4: Query the Pipeline
With embeddings stored, you can retrieve relevant chunks and pass them to an LLM as context:
def query_rag(question: str, top_k: int = 5) -> str:
"""Retrieve relevant chunks and generate an answer."""
# Embed the question
q_embedding = get_embeddings([question])[0]
# Retrieve top-k chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=top_k,
)
# Build context from retrieved documents
context_parts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
context_parts.append(f"[Source: {meta['url']}]\n{doc}")
context = "\n\n---\n\n".join(context_parts)
# Generate answer with GPT-4o
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a helpful assistant. Answer the user's question "
"based only on the provided context. If the context does not "
"contain enough information, say so. Cite your sources."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
},
],
temperature=0.2,
)
return response.choices[0].message.content
answer = query_rag("How do I authenticate API requests?")
print(answer)
That is the complete pipeline: crawl, chunk, embed, store, query. Five steps, roughly 100 lines of code, and it runs end to end in under 30 minutes including crawl time for a 50-page site.
Why Markdown Beats Raw HTML for RAG
This is the part most tutorials skip, and it is the difference between a pipeline that retrieves useful context and one that returns noise.
A typical HTML page is 90% boilerplate: navigation menus, footers, script tags, inline styles, tracking pixels, cookie banners, ad containers. When you embed raw HTML, all of that boilerplate becomes part of the vector representation. The result is that semantically unrelated pages end up with similar embeddings because they share the same template chrome, and your retriever returns irrelevant results.
Spider’s markdown conversion strips all of that away before you ever see the content. What you get is the actual text of the page: headings, paragraphs, lists, code blocks, and tables. Nothing else.
The practical impact:
- Smaller chunks, less noise. A markdown chunk is mostly content. An HTML chunk of the same length carries tag attributes, class names, and script fragments that add bytes without adding meaning.
- Better embedding quality. Embeddings trained on natural language perform best on natural language, not on
<div class="nav-wrapper">. - Lower embedding costs. Fewer tokens per page means fewer API calls to your embedding model. Markdown consistently produces fewer tokens than the equivalent HTML, with the gap depending on how boilerplate-heavy the source pages are.
- More accurate retrieval. Without template noise pulling embeddings toward each other, cosine similarity actually reflects semantic similarity.
If you are building RAG on web data and you are not converting to markdown first, you are leaving retrieval quality on the table.
Framework Integrations
Spider ships as a native document loader in LangChain, LlamaIndex, CrewAI, and Microsoft AutoGen. Below is the same RAG pipeline built in each framework, using Spider as the data source.
LangChain
Install dependencies:
pip install langchain langchain-openai langchain-chroma langchain-community spider-client
import os
from langchain_community.document_loaders import SpiderLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
# 1. Crawl with Spider's LangChain loader
loader = SpiderLoader(
url="https://docs.example.com",
api_key=os.getenv("SPIDER_API_KEY"),
mode="crawl",
params={
"return_format": "markdown",
"limit": 50,
"request": "smart",
},
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./chroma_langchain",
)
# 4. Query
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
)
result = qa_chain.invoke({"query": "How do I authenticate API requests?"})
print(result["result"])
for doc in result["source_documents"]:
print(f" Source: {doc.metadata.get('url', 'unknown')}")
LangChain’s SpiderLoader handles the API call, pagination, and document formatting internally. Each returned Document object has page_content set to the markdown and metadata containing the source URL.
LlamaIndex
Install dependencies:
pip install llama-index llama-index-readers-web spider-client
import os
from llama_index.readers.web import SpiderWebReader
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# 1. Crawl with Spider's LlamaIndex reader
reader = SpiderWebReader(
api_key=os.getenv("SPIDER_API_KEY"),
mode="crawl",
params={
"return_format": "markdown",
"limit": 50,
"request": "smart",
},
)
documents = reader.load_data(url="https://docs.example.com")
print(f"Loaded {len(documents)} documents")
# 2. Configure embedding and LLM
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = LlamaOpenAI(model="gpt-4o", temperature=0.2)
# 3. Build index (automatically chunks, embeds, and stores)
index = VectorStoreIndex.from_documents(documents)
# 4. Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("How do I authenticate API requests?")
print(response)
for node in response.source_nodes:
print(f" Source: {node.metadata.get('url', 'unknown')} (score: {node.score:.3f})")
LlamaIndex handles chunking, embedding, and indexing in a single from_documents call. The SpiderWebReader plugs directly into the standard reader interface.
CrewAI
Install dependencies:
pip install crewai 'crewai[tools]' langchain-community spider-client
import os
from crewai import Agent, Task, Crew
from crewai_tools import tool
from langchain_community.document_loaders import SpiderLoader
@tool("crawl_and_search")
def crawl_and_search(url: str, question: str) -> str:
"""Crawl a website with Spider and search the content for answers to a question."""
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Crawl
loader = SpiderLoader(
url=url,
api_key=os.getenv("SPIDER_API_KEY"),
mode="crawl",
params={"return_format": "markdown", "limit": 50, "request": "smart"},
)
docs = loader.load()
# Chunk and embed
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings(model="text-embedding-3-small"))
# Retrieve
results = vectorstore.similarity_search(question, k=5)
return "\n\n---\n\n".join([r.page_content for r in results])
researcher = Agent(
role="Research Analyst",
goal="Find accurate answers to technical questions using web documentation.",
backstory="You are a senior engineer who excels at finding precise information in documentation.",
tools=[crawl_and_search],
verbose=True,
)
task = Task(
description=(
"Crawl https://docs.example.com and answer this question: "
"How do I authenticate API requests? "
"Provide a detailed answer with code examples if available."
),
expected_output="A clear, detailed answer with relevant code snippets and source URLs.",
agent=researcher,
)
crew = Crew(agents=[researcher], tasks=[task], verbose=True)
result = crew.kickoff()
print(result)
CrewAI uses Spider through the @tool decorator pattern. The agent decides when to call the tool and how to interpret the results. This is useful when you want the LLM to orchestrate multiple research steps autonomously.
AutoGen
Install dependencies:
pip install pyautogen spider-client chromadb
import os
import chromadb
from spider import Spider
from openai import OpenAI
from autogen import ConversableAgent, register_function
config_list = [{"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")}]
llm_config = {"config_list": config_list, "temperature": 0.2}
spider_client = Spider(api_key=os.getenv("SPIDER_API_KEY"))
openai_client = OpenAI()
# Persistent vector store
chroma_client = chromadb.PersistentClient(path="./chroma_autogen")
collection = chroma_client.get_or_create_collection(
name="autogen_docs",
metadata={"hnsw:space": "cosine"},
)
def crawl_and_index(url: str, limit: int = 50) -> str:
"""Crawl a website and index the content for RAG retrieval."""
pages = spider_client.crawl_url(url, params={
"return_format": "markdown",
"limit": limit,
"request": "smart",
})
chunk_id = 0
for page in pages:
content = page.get("content", "")
page_url = page.get("url", "")
if not content:
continue
# Simple chunking
for i in range(0, len(content), 800):
chunk = content[i:i + 1000].strip()
if not chunk:
continue
emb = openai_client.embeddings.create(
input=[chunk], model="text-embedding-3-small"
).data[0].embedding
collection.add(
ids=[f"autogen_chunk_{chunk_id}"],
embeddings=[emb],
documents=[chunk],
metadatas=[{"url": page_url}],
)
chunk_id += 1
return f"Indexed {chunk_id} chunks from {len(pages)} pages at {url}"
def search_indexed_docs(question: str, top_k: int = 5) -> str:
"""Search the indexed documents for relevant content."""
q_emb = openai_client.embeddings.create(
input=[question], model="text-embedding-3-small"
).data[0].embedding
results = collection.query(query_embeddings=[q_emb], n_results=top_k)
parts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
parts.append(f"[Source: {meta['url']}]\n{doc}")
return "\n\n---\n\n".join(parts)
# Define agents
assistant = ConversableAgent(
"RAGAssistant",
llm_config=llm_config,
system_message=(
"You are a research assistant. Use the crawl_and_index tool to crawl "
"websites, then use search_indexed_docs to find answers. "
"Always cite your sources. Return 'TERMINATE' when done."
),
)
user_proxy = ConversableAgent(
"UserProxy",
llm_config=False,
human_input_mode="NEVER",
code_execution_config=False,
is_termination_msg=lambda x: x.get("content", "") is not None
and "terminate" in x.get("content", "").lower(),
default_auto_reply="Continue if not finished, otherwise return 'TERMINATE'.",
)
# Register tools with both agents
register_function(
crawl_and_index,
caller=assistant,
executor=user_proxy,
description="Crawl a website and index its content for search.",
)
register_function(
search_indexed_docs,
caller=assistant,
executor=user_proxy,
description="Search previously indexed documents for relevant content.",
)
# Run the conversation
user_proxy.initiate_chat(
assistant,
message=(
"Crawl https://docs.example.com and then answer: "
"How do I authenticate API requests?"
),
)
AutoGen’s function-calling pattern lets the assistant agent decide which tools to invoke and in what order. The crawl_and_index function handles the Spider crawl and embedding, while search_indexed_docs performs retrieval.
Estimated Costs
One of the most common questions when evaluating a RAG pipeline is “what will this cost in production?” Here is a breakdown for a realistic workload: crawling 500 pages of documentation and serving 1,000 queries per month.
Data ingestion (one-time per crawl)
| Component | Calculation | Cost |
|---|---|---|
| Spider crawl (500 pages, smart mode) | 500 x $0.00065 avg | ~$0.33 |
| OpenAI embeddings (est. 2,500 chunks, ~1.5M tokens) | $0.02 per 1M tokens | ~$0.03 |
| ChromaDB (self-hosted) | Open source, runs locally | $0.00 |
| Ingestion total | ~$0.36 |
Query serving (monthly, 1,000 queries)
| Component | Calculation | Cost |
|---|---|---|
| Embedding queries (1,000 queries, ~50K tokens) | $0.02 per 1M tokens | ~$0.001 |
| GPT-4o completions (1,000 queries, ~500K tokens) | $2.50 per 1M input, $10 per 1M output | ~$3.00 |
| ChromaDB (self-hosted) | Open source | $0.00 |
| Monthly query total | ~$3.00 |
If you use Pinecone instead of ChromaDB, the serverless tier starts free for up to 2GB of storage. Beyond that, costs scale based on read/write units and storage.
The LLM completion cost dominates in production. The crawling and embedding costs are negligible by comparison. Refreshing your data weekly (re-crawling 500 pages) adds roughly $1.44/month to the Spider bill.
Comparison with other scrapers
| Scraper | 500-page crawl cost | Notes |
|---|---|---|
| Spider | ~$0.33 | Pay-as-you-go, no subscription |
| Apify | ~$2.50+ | Depends on Actor and compute units |
| ScrapingBee | ~$4.90+ | JS rendering at 5 credits per request |
| Firecrawl | ~$1.00+ | Free tier limited, then subscription |
Spider’s per-page cost is among the lowest of managed scraping APIs, with no monthly minimum or subscription fee. Competitor pricing varies by plan and usage tier; check their current pricing pages for exact numbers.
Production Considerations
The pipeline above works for development and small production deployments. When scaling up, keep these points in mind.
Re-crawling and freshness
Web content changes. Set up a scheduled re-crawl (daily, weekly, or on-demand) and re-embed updated pages. Spider’s cache parameter lets you control whether results come from cache or a fresh fetch:
crawl_result = client.crawl_url(
"https://docs.example.com",
params={
"return_format": "markdown",
"limit": 500,
"request": "smart",
"cache": False, # Force fresh crawl
},
)
Deduplication
Many sites serve the same content at multiple URLs (www vs non-www, trailing slashes, query parameters). Hash each chunk’s content before inserting and skip duplicates. This prevents the retriever from wasting slots on identical results.
Chunking strategy
The recursive character splitter works as a general-purpose default. For better results on structured documentation, consider splitting on markdown headings (splitting at ## boundaries) so each chunk corresponds to a logical section rather than an arbitrary character boundary.
from langchain.text_splitter import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
)
md_chunks = splitter.split_text(page_content)
Hybrid search
Cosine similarity on dense embeddings works well for semantic queries (“how does authentication work?”) but poorly for keyword-exact queries (“error code 403”). Adding a BM25 sparse index alongside your vector index and combining scores gives you the best of both worlds. LangChain’s EnsembleRetriever supports this pattern out of the box.
Metadata filtering
Store structured metadata (URL path, page title, last-modified date) alongside each chunk. At query time, filter on metadata before running similarity search. This is especially useful when your corpus covers multiple products or documentation versions.
Wrapping Up
The hardest part of building a RAG pipeline is not the vector math or the LLM prompting. It is getting clean, structured input data. Garbage in, garbage out applies more literally here than anywhere else in software.
Things this tutorial did not cover that matter in production: re-ranking (add a cross-encoder after vector retrieval for significant quality gains), evaluation (build a test set of question-answer pairs and measure retrieval precision after every change), and freshness (schedule re-crawls and implement incremental updates rather than full re-ingestion).
The complete pipeline runs under 100 lines of Python. The code above works as-is — swap in your own URLs and you will have a working RAG system in minutes. The real work starts when you need to make it good.
Empower any project with AI-ready data
Join thousands of developers using Spider to power their data pipelines.