Skip to main content gottem  — one API for every scraper.
LangChain integration

Spider as a LangChain document loader.

Crawl websites, search the web, and feed clean markdown into your RAG chains, agents, and retrieval pipelines. Native Python integration via langchain-community.

Python langchain-community SpiderLoader RAG
Install pip
pip install spider-client langchain-community

# Quick import
from langchain_community.document_loaders \
    import SpiderLoader
01 · Document loader

Crawl a site, get LangChain Documents.

SpiderLoader returns Document objects with page_content (markdown) and metadata (url, title, crawl timestamp). Plug them directly into any vector store.

SpiderLoader Python
from langchain_community.document_loaders import SpiderLoader

# Crawl a website and load as LangChain documents
loader = SpiderLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={
        "return_format": "markdown",
        "limit": 50,
    }
)

docs = loader.load()
# Each doc has .page_content (markdown) and .metadata (url, title, etc.)

# Feed into a RAG chain
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
02 · What you get

Built for retrieval pipelines.

Loader

Document loader

SpiderLoader returns LangChain Document objects with page_content and metadata. Drop it into any existing chain.

Modes

Crawl, scrape, or search

Set mode to "crawl" for full-site indexing, "scrape" for specific pages, or use the search endpoint for web-wide discovery.

Output

RAG-ready markdown

Clean markdown output that embedding models perform well on. Navigation, ads, and boilerplate are stripped automatically.

Streaming

Lazy load

Use lazy_load() to stream documents as they are crawled. Start embedding while Spider is still fetching.

Params

All crawl parameters

Pass any Spider parameter through the loader: proxy mode, browser rendering, readability, custom selectors, and more.

Citations

Source attribution

Every document includes its URL, crawl timestamp, and page metadata for citation grounding in RAG responses.

03 · Search + live RAG

Answer questions with real-time sources.

Combine Spider's Search API with LangChain to ground answers in live web data. One call returns search results plus page content as markdown.

Search + ChatOpenAI Python
from spider import Spider
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

spider = Spider()
llm = ChatOpenAI(model="gpt-4o")

# Search the web and get content in one call
results = spider.search(
    "latest changes to GDPR enforcement",
    params={
        "search_limit": 5,
        "fetch_page_content": True,
        "return_format": "markdown",
    }
)

context = "\n---\n".join(
    [f"[{r['url']}]\n{r['content'][:3000]}" for r in results if r.get("content")]
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using the sources. Cite URLs."),
    ("user", "Sources:\n{context}\n\nQuestion: {question}"),
])

chain = prompt | llm
answer = chain.invoke({"context": context, "question": "What changed?"})
print(answer.content)
Start

Build a LangChain pipeline on live web data.

Free balance on sign-up. No subscription required.

pip install spider-client langchain-community
See also