Home

3 min read

Crawling Authenticated Pages with Spider

Contents

Prerequisites

Setting Up Your Environment

pip install spider-client

Using the Spider API

One of the simplest ways to handle authenticated pages is by setting the cookie directly in your request.

Example

from spider import Spider
import os

app = Spider(api_key='your_api_key')

json_data = {"url": "http://www.example.com", "return_format": "markdown", "cookies": "mycookie"}

response = app.crawl_url(json_data['url'], params=json_data, headers=headers)

print(response)

Using execution_scripts to Run Custom JavaScript

The execution_scripts parameter allows you to run custom JavaScript code on the page. This can be particularly useful for actions like logging in through a form.

Example

First, let’s create a script that handles the login process:

document.querySelector('input[name="username"]').value = 'your_username';
document.querySelector('input[name="password"]').value = 'your_password';
document.querySelector('form').submit();

Now incorporate this into the request:

from spider import Spider

app = Spider(api_key='your_api_key')

execution_scripts = {
    "http://www.example.com/login": `
    document.addEventListener("DOMContentLoaded", function() {
        document.querySelector("#username").value = "your_username";
        document.querySelector("#password").value = "your_password";
        document.querySelector("form").submit();
    });
    `
}

json_data = {
    "return_cookies": True,
    "return_headers": True,
    "request": "chrome",
    "execution_scripts": execution_scripts,
    "return_format": "markdown",
    "url": "http://www.example.com/login"
}

response = app.crawl_url(json_data['url'], params=json_data)
print(response)

Using GPT Config for AI-Driven Actions

You can use the gpt_config option to run custom AI actions that interact with the page, filling in forms or navigating through login sequences using models like GPT-4. If your account does not have at least $50 in credits, you will need to provide your own OpenAI API key.

from spider import Spider

app = Spider(api_key='your_api_key')

gpt_config = {
    "prompt": ["login with the username 'henry@mail.com' and password `something`.", "extract the main article"],
    "model": "gpt-4o",
    "max_tokens": 2000,
    "temperature": 0.54,
    "top_p": 0.17,
    "api_key": None  # your_openai_api_key is only necessary if your credits are under $50
}

json_data = {
    "url": "http://www.example.com/authenticated-page",
    "gpt_config": gpt_config,
    "return_format": "markdown"
}

response = app.crawl_url(json_data['url'], params=json_data)
print(response)

GPT Configs Structure

GPT configs in JSON format for your reference:

{
	"prompt": ["login with the username 'henry@mail.com' and password `something`.", "extract the main article"],
	"model": "gpt-4o",
	"max_tokens": 2000,
	"temperature": 0.54,
	"user": null,
	"top_p": 0.17,
	"prompt_url_map": null,
	"extra_ai_data": true,
	"paths_map": true,
	"screenshot": false,
	"api_key": "your_openai_api_key",
	"cache": null
}

Additional Parameters

Spider provides additional parameters to control the behavior of your crawl. Here are some useful ones:

For a full list of parameters, refer to the Spider API Reference.

Conclusion

Using the Spider API, you can effectively handle authenticated pages by setting cookies directly, running custom JavaScript, or using AI-driven actions. With these methods, you can ensure your crawling tasks access the necessary protected resources.

Build now, scale to millions