5/17/2024 •

8 min read

Guide - Automated Cold Email Outreach Using Spider

This guide will demonstrate how to automate the process of cold email outreach. You will learn how to extract email content, identify the company behind the email, search for their website, and craft a personalized email using the LLM-ready data returned from the website by Spider.

Retrieve Email
Setup OpenAI
Setup Spider & Langchain
- Reminder
Puzzling the Pieces Together
- Finding the Company’s Official Website
Explanation of the gpt_config
- Crafting a Personalized Email Based on the Website’s Content
Why do we use hub.pull("rlm/rag-prompt")?
Complete Code

Retrieve Email

For this guide, we will not cover how to get the contents of the email, as it varies between different services. Instead, we will use a variable with the email content.

email = '''
Thank you for your email, Gilbert,

I have looked into YourBusinessName, and it seems to suit some of our customers' requests, but not enough to make it profitable for us to invest time and money in integrating it into our current services. If you have any use cases in mind that suit our company, I might propose an idea to the others.

Best,
Matilda

SEO expert at Spider.cloud
'''

Setup OpenAI

Get OpenAI set up and running in a few minutes with the following steps:

Create an account and get an API Key on OpenAI.
Install OpenAI and set up the API key in your project as an environment variable. This approach prevents you from hardcoding the key in your code.

pip install openai

In your terminal:

export OPENAI_API_KEY=<your-api-key-here>

Alternatively, you can use the dotenv package to load the environment variables from a .env file. Create a .env file in your project root and add the following:

OPENAI_API_KEY=<your-api-key-here>

Then, in your Python code:

from dotenv import load_dotenv
from openai import OpenAI
import os

load_dotenv()

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

Test OpenAI to see if things are working correctly:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

chat_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": "What are large language models?",
        }
    ]
)

Setup Spider & Langchain

Getting started with the API is simple and straightforward. After you get your secret key, you can use the Spider LangChain document loader. We won’t rehash the full setup guide for Spider here, but if you want to use the API directly, you can check out the Spider API Guide to learn more.

Install the Spider Python client library and langchain:

pip install spider_client langchain langchain-community

Then import the SpiderLoader from the document loaders module:

from langchain_community.document_loaders import SpiderLoader

Let’s set up the Spider API for our example use case:

def load_markdown_from_url(urls):
    loader = SpiderLoader(
        url=urls,
        mode="crawl",
        params={
            "return_format": "markdown",
            "proxy_enabled": False,
            "request": "http",
            "request_timeout": 60,
            "limit": 1,
        },
    )
    data = loader.load()

Set the mode to crawl and use the return_format parameter to specify we want markdown content. The rest of the parameters are optional.

Reminder

Spider handles automatic concurrency and IP rotation to make it simple to scrape multiple URLs at once. The more credits you have, the higher your concurrency limit. Make sure you have enough credits if you choose to crawl more than one page.

For now, we’ll turn off the proxy and move on to setting up LangChain.

Puzzling the Pieces Together

Now that we have everything installed and working, we can start connecting the different pieces together.

First, we need to extract the company name from the email:

import os
from openai import OpenAI

email_content = '''
Thank you for your email, Gilbert,

I have looked into yourAutomatedCRM, and it seems to suit some of our customers' requests, but not enough to make it profitable for us to invest time and money in integrating it into our current services. If you have any use cases in mind that suit our company, I might be able to propose an idea to the others.

Best,
Matilda

SEO expert at Spider.cloud
'''

# Initialize OpenAI client
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

def extract_company_name(email):
    # Define messages
    messages = [{"role": "user", "content": f'Extract the company name and return ONLY the company name from the sender of this email: """{email_content}"""'}]

    # Call OpenAI API
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

    return completion.choices[0].message.content

company_name = extract_company_name(email_content)
print(company_name)

Finding the Company’s Official Website

By using Spider’s built-in AI scraping tools, we can specify our own prompt in our Spider API request.

“Return the official website of the company for company-name” on a Bing search suits this guide well since we want the URL for the company’s website.

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {
    "limit":1,
    "gpt_config":{
        "prompt":f'Return the official website of the company for {company_name}',
        "model":"gpt-4o-mini",
        "max_tokens":4096,
        "temperature":0.54,
        "top_p":0.17,
        "api_key": None
    },
    "url":"https://www.bing.com/search?q=spider.cloud"
}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, 
  json=json_data
)

company_url = response.json()[0]['metadata']['extracted_data']

Explanation of the `gpt_config`

The gpt_config in the Spider API request specifies the configuration for the GPT model used to process the scraped data. It includes parameters such as:

prompt: The prompt provided to the model (string or a list of strings)
model: The specific GPT model to use.
max_tokens: The maximum number of tokens to generate.
temperature: Controls the randomness of the output (higher values make output more random).
top_p: Controls the diversity of the output (higher values make output more diverse).

These settings ensure that the API generates coherent and contextually appropriate responses based on the scraped data.

Crafting a Personalized Email Based on the Website’s Content

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import SpiderLoader

company_url = 'https://spider.cloud'

def filter_metadata(doc):
    # Filter out or replace None values in metadata
    doc.metadata = {k: (v if v is not None else "") for k, v in doc.metadata.items()}
    return doc

def load_markdown_from_url(urls):
    loader = SpiderLoader(
        # env="your-api-key-here", # if no API key is provided it looks for SPIDER_API_KEY in env
        url=urls,
        mode="crawl",  
        params={
            "return_format": "markdown",
            "proxy_enabled": False,
            "request": "http",  
            "request_timeout": 60,
            "limit": 1,
        },
    )
    data = loader.load()
    return data

docs = load_markdown_from_url(company_url)

# Filter metadata in documents
docs = [filter_metadata(doc) for doc in docs]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

llm = ChatOpenAI(model="gpt-4o")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print(rag_chain.invoke(f'Craft a super personalized email answering {company_name}, addressing their response to our cold outreach campaign. Their email: """{email_content}"""'))

And the results should be something like this:

### output

Hi Matilda,

Thank you for considering AutomatedCRM. Given Spider.cloud's needs for efficient and large-scale data collection, our CRM can integrate seamlessly with tools like Spider, providing robust, high-speed data extraction and management. I'd love to discuss specific use cases where this integration could significantly enhance your current offerings.

Best,
Gilbert

Why do we use `hub.pull("rlm/rag-prompt")?`

We chose hub.pull("rlm/rag-prompt") for this use case because it provides a robust and flexible template for prompt construction, specifically designed for retrieval-augmented generation (RAG) tasks. This helps in creating contextually relevant and highly personalized responses by leveraging the extracted and processed data returned from Spider.

Complete Code

That’s it! We now have a fully automated cold email outreach system with Spider that responds with personalized emails based on the company’s website.

Here is the full code:

import requests, os
from openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import SpiderLoader

email_content = '''
Thank you for your email, Gilbert,

I have looked into yourAutomatedCRM, and it seems to suit some of our customers' requests, but not enough to make it profitable for us to invest time and money in integrating it into our current services. If you have any use cases in mind that suit our company, I might be able to propose an idea to the others.

Best,
Matilda

SEO expert at Spider.cloud
'''

# Initialize OpenAI client
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

def extract_company_name(email):
    # Define messages
    messages = [{"role": "user", "content": f'Extract the company name and return ONLY the company name from the sender of this email: """{email_content}"""'}]

    # Call OpenAI API
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

    return completion.choices[0].message.content

company_name = extract_company_name(email_content)

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

json_data = {
    "limit":1,
    "gpt_config":{
        "prompt":f'Return the official website of the company for {company_name}',
        "model":"gpt-4o-mini",
        "max_tokens":4096,
        "temperature":0.54,
        "top_p":0.17,
        "api_key": None
    },
    "url":"https://www.bing.com/search?q=spider.cloud"
}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, 
  json=json_data
)

company_url = response.json()[0]['metadata']['extracted_data']

def filter_metadata(doc):
    # Filter out or replace None values in metadata
    doc.metadata = {k: (v if v is not None else "") for k, v in doc.metadata.items()}
    return doc

def load_markdown_from_url(urls):
    loader = SpiderLoader(
        # env="your-api-key-here", # if no API key is provided it looks for SPIDER_API_KEY in env
        url=urls,
        mode="crawl",  
        params={
            "return_format": "markdown",
            "proxy_enabled": False,
            "request": "http",  
            "request_timeout": 60,
            "limit": 1,
        },
    )
    data = loader.load()
    return data

docs = load_markdown_from_url(company_url)

# Filter metadata in documents
docs = [filter_metadata(doc) for doc in docs]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

llm = ChatOpenAI(model="gpt-4o")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print(rag_chain.invoke(f'Craft a super personalized email addressing {company_name}, responding to their message as part of our cold outreach campaign. Their email: """{email_content}"""'))

If you liked this guide, consider checking out me and Spider on Twitter:

Author Twitter: WilliamEspegren
Spider Twitter: spider_rust

Empower any project with AI-ready data for LLMs

Get started