Integration

LangChain + Spider

Use Spider as a document loader in your LangChain applications. Crawl websites, search the web, and feed clean markdown directly into your RAG chains, agents, and retrieval pipelines.

Get Started Free View Docs

from langchain_community.document_loaders import SpiderLoader

# Crawl a website and load as LangChain documents
loader = SpiderLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={
        "return_format": "markdown",
        "limit": 50,
    }
)

docs = loader.load()
# Each doc has .page_content (markdown) and .metadata (url, title, etc.)

# Feed into a RAG chain
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

Document Loader

SpiderLoader returns LangChain Document objects with page_content and metadata. Drop it into any existing chain.

Crawl, Scrape, or Search

Set mode to "crawl" for full-site indexing, "scrape" for specific pages, or use the search endpoint for web-wide discovery.

RAG-Ready Markdown

Clean markdown output that embedding models perform well on. Navigation, ads, and boilerplate are stripped automatically.

Streaming Support

Use lazy_load() to stream documents as they are crawled. Start embedding while Spider is still fetching.

All Crawl Parameters

Pass any Spider parameter through the loader: proxy mode, browser rendering, readability, custom selectors, and more.

Source Attribution

Every document includes its URL, crawl timestamp, and page metadata for citation grounding in your RAG responses.

Search + LangChain for Live RAG

Combine Spider's Search API with LangChain to answer questions using real-time web data.

from spider import Spider
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

spider = Spider()
llm = ChatOpenAI(model="gpt-4o")

# Search the web and get content in one call
results = spider.search(
    "latest changes to GDPR enforcement",
    params={
        "search_limit": 5,
        "fetch_page_content": True,
        "return_format": "markdown",
    }
)

context = "\n---\n".join(
    [f"[{r['url']}]\n{r['content'][:3000]}" for r in results if r.get("content")]
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using the sources. Cite URLs."),
    ("user", "Sources:\n{context}\n\nQuestion: {question}"),
])

chain = prompt | llm
answer = chain.invoke({"context": context, "question": "What changed?"})
print(answer.content)

See also LlamaIndex · CrewAI · Search API · RAG Pipelines