Integration
LangChain + Spider
Use Spider as a document loader in your LangChain applications. Crawl websites, search the web, and feed clean markdown directly into your RAG chains, agents, and retrieval pipelines.
from langchain_community.document_loaders import SpiderLoader
# Crawl a website and load as LangChain documents
loader = SpiderLoader(
url="https://docs.example.com",
mode="crawl",
params={
"return_format": "markdown",
"limit": 50,
}
)
docs = loader.load()
# Each doc has .page_content (markdown) and .metadata (url, title, etc.)
# Feed into a RAG chain
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever() Document Loader
SpiderLoader returns LangChain Document objects with page_content and metadata. Drop it into any existing chain.
Crawl, Scrape, or Search
Set mode to "crawl" for full-site indexing, "scrape" for specific pages, or use the search endpoint for web-wide discovery.
RAG-Ready Markdown
Clean markdown output that embedding models perform well on. Navigation, ads, and boilerplate are stripped automatically.
Streaming Support
Use lazy_load() to stream documents as they are crawled. Start embedding while Spider is still fetching.
All Crawl Parameters
Pass any Spider parameter through the loader: proxy mode, browser rendering, readability, custom selectors, and more.
Source Attribution
Every document includes its URL, crawl timestamp, and page metadata for citation grounding in your RAG responses.
Search + LangChain for Live RAG
Combine Spider's Search API with LangChain to answer questions using real-time web data.
from spider import Spider
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
spider = Spider()
llm = ChatOpenAI(model="gpt-4o")
# Search the web and get content in one call
results = spider.search(
"latest changes to GDPR enforcement",
params={
"search_limit": 5,
"fetch_page_content": True,
"return_format": "markdown",
}
)
context = "\n---\n".join(
[f"[{r['url']}]\n{r['content'][:3000]}" for r in results if r.get("content")]
)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using the sources. Cite URLs."),
("user", "Sources:\n{context}\n\nQuestion: {question}"),
])
chain = prompt | llm
answer = chain.invoke({"context": context, "question": "What changed?"})
print(answer.content)