Home

3 min read

Crawling Authenticated Pages with Spider.cloud

Contents

Prerequisites

Setting Up Your Environment

pip install spider-client

Using the Spider.cloud API

One of the simplest ways to handle authenticated pages is by setting the cookie directly in your request.

Example

from spider import Spider
import os

# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Cookie': 'YOUR_AUTH_COOKIE',
}

json_data = {"url": "http://www.example.com", "return_format": "markdown"}
response = app.crawl_url(json_data['url'], params=json_data, headers=headers)
print(response)

Using execution_scripts to Run Custom JavaScript

The execution_scripts parameter allows you to run custom JavaScript code on the page. This can be particularly useful for actions like logging in through a form.

Example

First, let’s create a script that handles the login process:

document.querySelector('input[name="username"]').value = 'your_username';
document.querySelector('input[name="password"]').value = 'your_password';
document.querySelector('form').submit();

Now incorporate this into the request:

from spider import Spider

# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')

execution_scripts = {
    "http://www.example.com/login": `
    document.addEventListener("DOMContentLoaded", function() {
        document.querySelector("#username").value = "your_username";
        document.querySelector("#password").value = "your_password";
        document.querySelector("form").submit();
    });
    `
}

json_data = {
    "return_cookies": True,
    "return_headers": True,
    "request": "chrome",
    "execution_scripts": execution_scripts,
    "return_format": "markdown",
    "url": "http://www.example.com/login"
}

response = app.crawl_url(json_data['url'], params=json_data)
print(response)

Using GPT Config for AI-Driven Actions

You can use the gpt_config option to run custom AI actions that interact with the page, filling in forms or navigating through login sequences using models like GPT-4. If your account does not have at least $50 in credits, you will need to provide your own OpenAI API key.

from spider import Spider

# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')

gpt_config = {
    "prompt": "extract the main article",
    "model": "gpt-4",
    "max_tokens": 2000,
    "temperature": 0.54,
    "top_p": 0.17,
    "api_key": None  # your_openai_api_key is only necessary if your credits are under $50
}

json_data = {
    "url": "http://www.example.com/authenticated-page",
    "gpt_config": gpt_config,
    "return_format": "markdown"
}

response = app.crawl_url(json_data['url'], params=json_data)
print(response)

GPT Configs Structure

GPT configs in JSON format for your reference:

{
	"prompt": "extract the main article",
	"model": "gpt-4",
	"max_tokens": 2000,
	"temperature": 0.54,
	"user": null,
	"top_p": 0.17,
	"prompt_url_map": null,
	"extra_ai_data": true,
	"paths_map": true,
	"screenshot": false,
	"api_key": "your_openai_api_key",
	"cache": null
}

Additional Parameters

Spider.cloud provides additional parameters to control the behavior of your crawl. Here are some useful ones:

For a full list of parameters, refer to the Spider.cloud API Reference.

Conclusion

Using the Spider.cloud API, you can effectively handle authenticated pages by setting cookies directly, running custom JavaScript, or using AI-driven actions. With these methods, you can ensure your crawling tasks access the necessary protected resources.

For more details, refer to the official Spider API documentation. Happy Crawling!

Build now, scale to millions