Guides / Crawling Authenticated Pages

Crawling Authenticated Pages

Two methods for crawling pages behind login walls: cookies and execution scripts.

2 min read Jeff Mendez

Crawling Authenticated Pages

Some pages require authentication before you can access their content. Spider supports two methods for handling login-protected pages:

  1. Cookies: Pass a session cookie directly in the request.
  2. Execution scripts: Run custom JavaScript to log in through a form.

If you already have a valid session cookie (from your browser dev tools or a login API), pass it with the cookies parameter:

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com/dashboard",
    "return_format": "markdown",
    "cookies": "session_id=abc123; auth_token=xyz789",
    "request": "chrome"
  }
)

for page in response.json():
    print(page['url'], len(page.get('content', '')), 'chars')

To get the cookie value, log into the site in your browser, open dev tools (F12), go to the Application tab, and copy the relevant cookies.

Using Execution Scripts

The execution_scripts parameter runs JavaScript on the page before Spider extracts content. This is useful for filling in login forms programmatically.

import requests, os

headers = {
    'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
    'Content-Type': 'application/json',
}

login_script = """
document.querySelector('#username').value = 'your_username';
document.querySelector('#password').value = 'your_password';
document.querySelector('form').submit();
"""

response = requests.post('https://api.spider.cloud/crawl',
  headers=headers,
  json={
    "url": "https://example.com/login",
    "return_format": "markdown",
    "request": "chrome",
    "return_cookies": True,
    "execution_scripts": {
      "https://example.com/login": login_script
    }
  }
)

print(response.json())

The script runs after the page loads in headless Chrome. After the form submits, Spider follows the redirect and crawls the authenticated pages. Set return_cookies: true to capture the session cookies for subsequent requests.

Tips

  • Use Chrome mode: Authentication flows require JavaScript rendering. Always set request: "chrome".
  • Cookie format: Pass cookies as a semicolon-separated string, matching the format from browser dev tools.
  • wait_for: If the login redirects to a page that loads dynamically, use wait_for to wait for a specific element before extraction.
  • Scope with whitelist/blacklist: After login, control which paths Spider crawls using the whitelist and blacklist parameters.

For all available parameters, see the API reference.

Empower any project with
AI-ready data

Join thousands of developers using Spider to power their data pipelines.