Guides / Automated Cold Email Outreach Using Spider

Automated Cold Email Outreach Using Spider

Extract company info from inbound emails, scrape their website with Spider, and generate personalized replies with RAG.

7 min read William Espegren

Automated Cold Email Outreach Using Spider

Automate cold email responses by extracting the sender’s company from inbound emails, scraping their website with Spider, and generating a personalized reply using RAG.

Retrieve Email

For this guide, we will not cover how to get the contents of the email, as it varies between different services. Instead, we will use a variable with the email content.

email = '''
Thank you for your email, Gilbert,

I have looked into YourBusinessName, and it seems to suit some of our customers' requests, but not enough to make it profitable for us to invest time and money in integrating it into our current services. If you have any use cases in mind that suit our company, I might propose an idea to the others.

Best,
Matilda

SEO expert at Spider.cloud
'''

Setup OpenAI

Get OpenAI running with the following steps:

  1. Create an account and get an API Key on OpenAI.

  2. Install OpenAI and set up the API key in your project as an environment variable. This approach prevents you from hardcoding the key in your code.

pip install openai

In your terminal:

export OPENAI_API_KEY=<your-api-key-here>

Alternatively, you can use the dotenv package to load the environment variables from a .env file. Create a .env file in your project root and add the following:

OPENAI_API_KEY=<your-api-key-here>

Then, in your Python code:

from dotenv import load_dotenv
from openai import OpenAI
import os

load_dotenv()

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)
  1. Test OpenAI to see if things are working correctly:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

chat_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": "What are large language models?",
        }
    ]
)

Setup Spider & Langchain

After you get your API key, install the Spider client and LangChain. For the full API reference, see the Spider API Guide.

Install the Spider Python client library and langchain:

pip install spider_client langchain langchain-community

Then import the SpiderLoader from the document loaders module:

from langchain_community.document_loaders import SpiderLoader

Set up the Spider loader:

def load_markdown_from_url(urls):
    loader = SpiderLoader(
        url=urls,
        mode="crawl",
        params={
            "return_format": "markdown",
            "proxy_enabled": False,
            "request": "http",
            "request_timeout": 60,
            "limit": 1,
        },
    )
    data = loader.load()

Set the mode to crawl and use the return_format parameter to specify we want markdown content. The rest of the parameters are optional.

Reminder

Spider handles concurrency and IP rotation automatically. Higher credit balances unlock higher concurrency limits.

For now, we’ll turn off the proxy and move on to setting up LangChain.

Puzzling the Pieces Together

Extract the company name from the email:

import os
from openai import OpenAI

email_content = '''
Thank you for your email, Gilbert,

I have looked into yourAutomatedCRM, and it seems to suit some of our customers' requests, but not enough to make it profitable for us to invest time and money in integrating it into our current services. If you have any use cases in mind that suit our company, I might be able to propose an idea to the others.

Best,
Matilda

SEO expert at Spider.cloud
'''

# Initialize OpenAI client
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

def extract_company_name(email):
    # Define messages
    messages = [{"role": "user", "content": f'Extract the company name and return ONLY the company name from the sender of this email: """{email_content}"""'}]

    # Call OpenAI API
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

    return completion.choices[0].message.content

company_name = extract_company_name(email_content)
print(company_name)

Finding the Company’s Official Website

Use Spider’s gpt_config to extract the company’s website URL from a Bing search:

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {
    "limit":1,
    "gpt_config":{
        "prompt":f'Return the official website of the company for {company_name}',
        "model":"gpt-4o-mini",
        "max_tokens":4096,
        "temperature":0.54,
        "top_p":0.17,
        "api_key": None
    },
    "url":"https://www.bing.com/search?q=spider.cloud"
}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, 
  json=json_data
)

company_url = response.json()[0]['metadata']['extracted_data']

gpt_config parameters

  • prompt: The prompt provided to the model (string or a list of strings)
  • model: The specific GPT model to use.
  • max_tokens: The maximum number of tokens to generate.
  • temperature: Controls the randomness of the output (higher values make output more random).
  • top_p: Controls the diversity of the output (higher values make output more diverse).

Crafting a Personalized Email Based on the Website’s Content

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import SpiderLoader

company_url = 'https://spider.cloud'

def filter_metadata(doc):
    # Filter out or replace None values in metadata
    doc.metadata = {k: (v if v is not None else "") for k, v in doc.metadata.items()}
    return doc

def load_markdown_from_url(urls):
    loader = SpiderLoader(
        # env="your-api-key-here", # if no API key is provided it looks for SPIDER_API_KEY in env
        url=urls,
        mode="crawl",  
        params={
            "return_format": "markdown",
            "proxy_enabled": False,
            "request": "http",  
            "request_timeout": 60,
            "limit": 1,
        },
    )
    data = loader.load()
    return data

docs = load_markdown_from_url(company_url)

# Filter metadata in documents
docs = [filter_metadata(doc) for doc in docs]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

llm = ChatOpenAI(model="gpt-4o")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print(rag_chain.invoke(f'Craft a super personalized email answering {company_name}, addressing their response to our cold outreach campaign. Their email: """{email_content}"""'))

Example output:

### output

Hi Matilda,

Thank you for considering AutomatedCRM. Given Spider.cloud's needs for efficient and large-scale data collection, our CRM can integrate seamlessly with tools like Spider, providing robust, high-speed data extraction and management. I'd love to discuss specific use cases where this integration could significantly enhance your current offerings.

Best,
Gilbert

Why hub.pull("rlm/rag-prompt")?

This LangChain Hub prompt template is designed for RAG tasks. It takes retrieved context and a question, then generates a grounded response using only the provided data.

Complete Code

Full code:

import requests, os
from openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import SpiderLoader

email_content = '''
Thank you for your email, Gilbert,

I have looked into yourAutomatedCRM, and it seems to suit some of our customers' requests, but not enough to make it profitable for us to invest time and money in integrating it into our current services. If you have any use cases in mind that suit our company, I might be able to propose an idea to the others.

Best,
Matilda

SEO expert at Spider.cloud
'''

# Initialize OpenAI client
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

def extract_company_name(email):
    # Define messages
    messages = [{"role": "user", "content": f'Extract the company name and return ONLY the company name from the sender of this email: """{email_content}"""'}]

    # Call OpenAI API
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

    return completion.choices[0].message.content

company_name = extract_company_name(email_content)

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {
    "limit":1,
    "gpt_config":{
        "prompt":f'Return the official website of the company for {company_name}',
        "model":"gpt-4o-mini",
        "max_tokens":4096,
        "temperature":0.54,
        "top_p":0.17,
        "api_key": None
    },
    "url":"https://www.bing.com/search?q=spider.cloud"
}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, 
  json=json_data
)

company_url = response.json()[0]['metadata']['extracted_data']

def filter_metadata(doc):
    # Filter out or replace None values in metadata
    doc.metadata = {k: (v if v is not None else "") for k, v in doc.metadata.items()}
    return doc

def load_markdown_from_url(urls):
    loader = SpiderLoader(
        # env="your-api-key-here", # if no API key is provided it looks for SPIDER_API_KEY in env
        url=urls,
        mode="crawl",  
        params={
            "return_format": "markdown",
            "proxy_enabled": False,
            "request": "http",  
            "request_timeout": 60,
            "limit": 1,
        },
    )
    data = loader.load()
    return data

docs = load_markdown_from_url(company_url)

# Filter metadata in documents
docs = [filter_metadata(doc) for doc in docs]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

llm = ChatOpenAI(model="gpt-4o")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print(rag_chain.invoke(f'Craft a super personalized email addressing {company_name}, responding to their message as part of our cold outreach campaign. Their email: """{email_content}"""'))

If you liked this guide, consider checking out me and Spider on Twitter:

Empower any project with AI-ready data

Join thousands of developers using Spider to power their data pipelines.