Blog / Real-Time Web Search for RAG: Stop Feeding Your LLM Stale Data

Real-Time Web Search for RAG: Stop Feeding Your LLM Stale Data

Static document stores go stale within days. Here's how to add live web search to your RAG pipeline so your LLM always answers with current information. Complete implementations in Python with LangChain and vanilla code.

8 min read Jeff Mendez

There’s a specific failure mode in RAG pipelines that nobody talks about enough. You build the pipeline, demo it, everything works. Two weeks later someone asks a question about something that changed last Tuesday and the system confidently cites information that’s already wrong. Not hallucinated, just outdated. It was correct when you indexed it.

The standard RAG architecture has a built-in shelf life. You crawl documents, chunk them, embed them, and store them in a vector database. At query time, you retrieve the nearest chunks and feed them to the LLM. The problem is that step one happened days or weeks ago. Everything between index runs is a blind spot.

Re-indexing more frequently helps but doesn’t fix it. You can’t predict which pages will change or when. And for questions about current events, there’s no pre-indexed document to retrieve at all.

The real fix is adding a second retrieval path: live web search.

The freshness gap nobody plans for

Here’s a scenario that happens constantly. You build a RAG system over your company’s documentation. A customer asks about a new feature that shipped three days ago. The docs were updated the same day. But your weekly re-index hasn’t run yet. The RAG pipeline retrieves the old version of the page, and the LLM tells the customer the feature doesn’t exist.

Or worse: a regulation changes, a competitor updates their pricing, a security vulnerability gets disclosed. Your pipeline keeps serving the previous version because it literally doesn’t know anything newer exists.

The usual response is to index more often. Hourly, every fifteen minutes, continuously. But this gets expensive fast, both in compute and in API costs, and it still doesn’t solve the problem for questions that span the entire web. Your index covers your sources. The user’s question might be about anything.

Two retrieval paths, one answer

The architecture that actually works combines your existing vector database with a live web search at query time:

User Question
     |
     ├──> Vector DB (your indexed docs)  ──> Top-K chunks
     |
     └──> Live Web Search  ──> Fresh scraped content
                                    |
                               Combined Context
                                    |
                                   LLM
                                    |
                                 Answer

The vector DB handles questions your corpus covers well. The web search handles everything else. The LLM sees both and can synthesize a grounded answer that’s current.

Some people call this “Search-Augmented Generation.” The name doesn’t matter. What matters is that your LLM stops confidently stating things that were true last month.

The simple version: web search only

If you don’t have a vector database yet, or you’re building a general-purpose assistant that needs to answer questions about anything on the web, start here. This is the minimal viable search-augmented pipeline.

from spider import Spider
from openai import OpenAI
import os

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))
llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def search_and_answer(question: str, num_sources: int = 5) -> dict:
    # Search the web and scrape results in one API call
    results = spider.search(
        question,
        params={
            "search_limit": num_sources,
            "fetch_page_content": True,
            "return_format": "markdown",
            "readability": True,
            "tbs": "qdr:m",  # prefer results from the past month
        },
    )

    # Build context with source attribution
    sources = []
    context_parts = []
    for r in results:
        content = r.get("content", "")
        if not content:
            continue
        url = r["url"]
        sources.append(url)
        context_parts.append(f"[Source: {url}]\n{content[:4000]}")

    context = "\n\n---\n\n".join(context_parts)

    response = llm.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the user's question using only the provided web "
                    "sources. Cite each claim with the source URL in brackets. "
                    "If the sources don't fully cover the question, say what "
                    "you can and note what's missing."
                ),
            },
            {
                "role": "user",
                "content": f"## Sources\n\n{context}\n\n## Question\n\n{question}",
            },
        ],
        temperature=0.1,
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": sources,
    }

That’s a complete search-augmented pipeline in about 40 lines. The Spider search call typically takes 1-2 seconds (it searches and scrapes all results in parallel), and the LLM generates the answer in another 1-3 seconds. Total: 3-5 seconds for a cited, grounded response.

A few things to notice. The readability flag strips navigation bars, cookie banners, and sidebar junk from the scraped pages. Without it, you waste context tokens on noise. The tbs: "qdr:m" filter restricts results to the past month, which helps when the question is about recent information but doesn’t hurt when it isn’t.

For production systems with an existing knowledge base, the hybrid approach is more powerful. You get the specificity of your curated index plus the freshness of live web results.

from spider import Spider
from openai import OpenAI
import chromadb
import os

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("knowledge_base")


def embed(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return response.data[0].embedding


def hybrid_rag(question: str) -> dict:
    # Path 1: Retrieve from your indexed knowledge base
    db_results = collection.query(
        query_embeddings=[embed(question)], n_results=3
    )
    indexed_chunks = []
    for doc, meta in zip(db_results["documents"][0], db_results["metadatas"][0]):
        indexed_chunks.append(f"[Indexed: {meta.get('url', 'internal')}]\n{doc}")

    # Path 2: Search the live web
    web_results = spider.search(
        question,
        params={
            "search_limit": 3,
            "fetch_page_content": True,
            "return_format": "markdown",
            "readability": True,
        },
    )
    web_chunks = []
    for r in web_results:
        content = r.get("content", "")
        if content:
            web_chunks.append(f"[Live web: {r['url']}]\n{content[:4000]}")

    # Combine both
    context = "\n\n---\n\n".join(indexed_chunks + web_chunks)

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You have two types of sources: pre-indexed documents "
                    "and live web results. Prefer live web for time-sensitive "
                    "info. Prefer indexed sources for established facts. "
                    "Cite each source. Flag any contradictions."
                ),
            },
            {
                "role": "user",
                "content": f"## Context\n\n{context}\n\n## Question\n\n{question}",
            },
        ],
        temperature=0.1,
    )

    return {"answer": response.choices[0].message.content}

The interesting design choice here is running both retrieval paths for every query. You could add routing logic to skip the web search for questions your index handles well. In practice, the extra 1-2 seconds of latency is worth the insurance. We’ve seen cases where the web search catches an updated doc version that the index missed, and the LLM correctly synthesizes the newer information. That’s the whole point.

LangChain version

If you’re already in the LangChain ecosystem, here’s the same hybrid pattern using Spider’s LangChain integration:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from spider import Spider
import os

spider = Spider(api_key=os.getenv("SPIDER_API_KEY"))
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
vectorstore = Chroma(
    persist_directory="./chroma_langchain",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)


def hybrid_retrieve(question: str) -> str:
    # Vector store path
    indexed_docs = vectorstore.similarity_search(question, k=3)
    indexed_text = "\n\n".join(
        [f"[Indexed: {d.metadata.get('source', '?')}]\n{d.page_content}" for d in indexed_docs]
    )

    # Live web path
    results = spider.search(
        question,
        params={
            "search_limit": 3,
            "fetch_page_content": True,
            "return_format": "markdown",
            "readability": True,
        },
    )
    web_text = "\n\n".join(
        [f"[Live: {r['url']}]\n{r.get('content', '')[:3000]}" for r in results if r.get("content")]
    )

    return f"## Indexed Sources\n\n{indexed_text}\n\n## Live Web\n\n{web_text}"


prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using the provided sources. Cite URLs. Flag stale vs fresh info."),
    ("user", "## Sources\n\n{context}\n\n## Question\n\n{question}"),
])

chain = (
    {"context": lambda x: hybrid_retrieve(x["question"]), "question": lambda x: x["question"]}
    | prompt
    | llm
)

answer = chain.invoke({"question": "What changed in Google's search algorithm this month?"})
print(answer.content)

What this costs

Let’s be specific. Here’s the per-query breakdown for a hybrid RAG query with 3 indexed results + 3 live web results:

ComponentCost
Spider search + scrape (3 results)~$0.002
ChromaDB lookup (local)$0
Embedding the question~$0.00001
GPT-4o generation (~2K input, ~500 output)~$0.01
Total per query~$0.012

Spider is the cheap part. The LLM is where the cost lives. Compare this to running a separate SERP API ($0.01-0.05 per search) plus a separate scraping service ($0.005-0.02 per page). Combined cost with Spider is 3-5x lower because one call replaces six.

Lessons from production

A few things we’ve learned from teams running this pattern at scale:

Keep search_limit between 3 and 5 for interactive use. More results means more tokens, more latency, and more cost. Five sources is almost always enough for a well-grounded answer. Save the deep 20-50 result searches for batch research jobs where latency doesn’t matter.

Truncate aggressively. A single web page can easily be 10,000+ tokens after markdown conversion. The useful information is usually in the first 2,000-3,000 tokens. Trim each source so you have room for the LLM to actually generate a response.

Use time filters when the question implies recency. “What happened today” and “what’s the current policy” are different questions. A simple keyword check for time indicators (“today,” “this week,” “latest,” “current,” “recent”) can trigger the tbs parameter automatically.

Stream when you can. Use JSONL content type with Spider to receive results as each page finishes scraping. Start feeding early results to the LLM while later ones are still being fetched. This cuts perceived latency by 1-2 seconds.

Log everything. Save the search query, the URLs returned, the content lengths, and the LLM’s answer. When a user complains about a bad answer, you need to know whether the problem was bad search results, bad content extraction, or bad LLM reasoning. They require very different fixes.

When you need this, and when you don’t

Add live search when:

  • Users ask about current events or recently changed information
  • Your source documents update faster than you re-index
  • You need coverage beyond your indexed sources
  • Users expect cited answers with verifiable source URLs

Stick with static RAG when:

  • Your corpus is stable (internal docs, product manuals, legal contracts)
  • Freshness genuinely doesn’t matter (historical data, reference material)
  • You need sub-100ms retrieval latency (vector lookups are faster)
  • You’re operating offline or air-gapped

Use both when:

  • You have a core knowledge base but it can’t cover everything
  • You want to cross-check indexed info against current web sources
  • Your users ask a mix of domain-specific and general questions

Most production systems we see end up using both. The cost of running a web search on every query is small compared to the cost of serving a confidently wrong answer.

Try it

from spider import Spider

client = Spider()
results = client.search(
    "your question here",
    params={"search_limit": 5, "fetch_page_content": True, "return_format": "markdown"}
)

for r in results:
    print(r["url"], len(r.get("content", "")), "chars")

No subscription. Pay per request. Credits never expire.

Create an account | Search API docs | RAG pipeline tutorial

Empower any project with
AI-ready data

Join thousands of developers using Spider to power their data pipelines.