Crawling Authenticated Pages
Some pages require authentication before you can access their content. Spider supports two methods for handling login-protected pages:
- Cookies: Pass a session cookie directly in the request.
- Execution scripts: Run custom JavaScript to log in through a form.
Setting a Cookie
If you already have a valid session cookie (from your browser dev tools or a login API), pass it with the cookies parameter:
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com/dashboard",
"return_format": "markdown",
"cookies": "session_id=abc123; auth_token=xyz789",
"request": "chrome"
}
)
for page in response.json():
print(page['url'], len(page.get('content', '')), 'chars')
To get the cookie value, log into the site in your browser, open dev tools (F12), go to the Application tab, and copy the relevant cookies.
Using Execution Scripts
The execution_scripts parameter runs JavaScript on the page before Spider extracts content. This is useful for filling in login forms programmatically.
import requests, os
headers = {
'Authorization': f'Bearer {os.getenv("SPIDER_API_KEY")}',
'Content-Type': 'application/json',
}
login_script = """
document.querySelector('#username').value = 'your_username';
document.querySelector('#password').value = 'your_password';
document.querySelector('form').submit();
"""
response = requests.post('https://api.spider.cloud/crawl',
headers=headers,
json={
"url": "https://example.com/login",
"return_format": "markdown",
"request": "chrome",
"return_cookies": True,
"execution_scripts": {
"https://example.com/login": login_script
}
}
)
print(response.json())
The script runs after the page loads in headless Chrome. After the form submits, Spider follows the redirect and crawls the authenticated pages. Set return_cookies: true to capture the session cookies for subsequent requests.
Tips
- Use Chrome mode: Authentication flows require JavaScript rendering. Always set
request: "chrome". - Cookie format: Pass cookies as a semicolon-separated string, matching the format from browser dev tools.
- wait_for: If the login redirects to a page that loads dynamically, use
wait_forto wait for a specific element before extraction. - Scope with whitelist/blacklist: After login, control which paths Spider crawls using the
whitelistandblacklistparameters.
For all available parameters, see the API reference.