Spider as a LangChain document loader.
Crawl websites, search the web, and feed clean markdown into your RAG chains, agents, and retrieval pipelines. Native Python integration via langchain-community.
pip install spider-client langchain-community
# Quick import
from langchain_community.document_loaders \
import SpiderLoaderCrawl a site, get LangChain Documents.
SpiderLoader returns Document objects with page_content (markdown) and metadata (url, title, crawl timestamp). Plug them directly into any vector store.
from langchain_community.document_loaders import SpiderLoader
# Crawl a website and load as LangChain documents
loader = SpiderLoader(
url="https://docs.example.com",
mode="crawl",
params={
"return_format": "markdown",
"limit": 50,
}
)
docs = loader.load()
# Each doc has .page_content (markdown) and .metadata (url, title, etc.)
# Feed into a RAG chain
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()Built for retrieval pipelines.
Document loader
SpiderLoader returns LangChain Document objects with page_content and metadata. Drop it into any existing chain.
Crawl, scrape, or search
Set mode to "crawl" for full-site indexing, "scrape" for specific pages, or use the search endpoint for web-wide discovery.
RAG-ready markdown
Clean markdown output that embedding models perform well on. Navigation, ads, and boilerplate are stripped automatically.
Lazy load
Use lazy_load() to stream documents as they are crawled. Start embedding while Spider is still fetching.
All crawl parameters
Pass any Spider parameter through the loader: proxy mode, browser rendering, readability, custom selectors, and more.
Source attribution
Every document includes its URL, crawl timestamp, and page metadata for citation grounding in RAG responses.
Answer questions with real-time sources.
Combine Spider's Search API with LangChain to ground answers in live web data. One call returns search results plus page content as markdown.
from spider import Spider
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
spider = Spider()
llm = ChatOpenAI(model="gpt-4o")
# Search the web and get content in one call
results = spider.search(
"latest changes to GDPR enforcement",
params={
"search_limit": 5,
"fetch_page_content": True,
"return_format": "markdown",
}
)
context = "\n---\n".join(
[f"[{r['url']}]\n{r['content'][:3000]}" for r in results if r.get("content")]
)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using the sources. Cite URLs."),
("user", "Sources:\n{context}\n\nQuestion: {question}"),
])
chain = prompt | llm
answer = chain.invoke({"context": context, "question": "What changed?"})
print(answer.content)Build a LangChain pipeline on live web data.
Free balance on sign-up. No subscription required.