Spider Blog - How to Bypass Cloudflare, DataDome, and PerimeterX in 2026

If you scrape the web at any meaningful scale, you have already run into at least one of these three systems. Cloudflare sits in front of roughly 20% of all websites. DataDome protects major e-commerce and travel platforms. PerimeterX (now operating under the HUMAN brand) guards high-value targets across finance, ticketing, and retail.

Each one uses a different detection strategy. Each one updates its models on a weekly or biweekly cycle. And each one is specifically designed to make your life harder as a developer trying to collect public data programmatically.

This post breaks down how these systems actually work at the protocol and behavioral level, covers the common bypass techniques developers reach for, explains why maintaining those bypasses yourself is a losing investment, and shows how a managed approach eliminates the problem entirely.

How Cloudflare Detects Bots

Cloudflare operates at the network edge. Every HTTP request to a Cloudflare-protected site passes through their reverse proxy before it reaches the origin server. Detection happens in layers.

Managed Rules and Bot Score

Every request receives a bot score between 1 and 99. A score of 1 means Cloudflare is almost certain the request is automated. A score of 99 means it looks fully human. This score is computed from a combination of signals:

IP reputation: Cloudflare maintains a global threat intelligence feed. Datacenter IP ranges, known proxy providers, and IPs with a history of abusive traffic get flagged immediately.
TLS fingerprint (JA3/JA4): The TLS ClientHello message contains a unique combination of cipher suites, extensions, and supported curves. Cloudflare hashes this into a fingerprint and compares it against known browser profiles. If your HTTP client’s TLS handshake looks like python-requests/2.31 instead of Chrome 124, the request is scored accordingly.
HTTP/2 fingerprint: Beyond TLS, Cloudflare inspects HTTP/2 SETTINGS frames, WINDOW_UPDATE parameters, header order, and pseudo-header ordering. Real browsers produce consistent, predictable patterns. Most HTTP libraries do not.
Header analysis: Missing headers, unusual header order, or headers that contradict the claimed User-Agent all contribute to a lower bot score.

Managed rules act on this score. Site operators configure thresholds: block requests below 30, challenge requests below 50, allow everything above. The defaults are aggressive enough to catch naive automation.

Turnstile Challenges

When Cloudflare decides a request is suspicious but not definitively automated, it serves a Turnstile challenge. Turnstile replaced the older hCaptcha integration in 2023 and works differently from traditional CAPTCHAs.

Turnstile runs a JavaScript challenge in the browser that collects environmental signals: canvas fingerprint, WebGL renderer, installed fonts, screen dimensions, timezone, language settings, and dozens of other browser API outputs. It also measures execution timing. A real browser running on real hardware produces consistent timing characteristics. A headless browser running inside a Docker container on a server does not.

The critical detail: Turnstile challenges are invisible by default. The user never sees a checkbox or image grid. The JavaScript executes silently, collects its signals, and either passes or fails. This means you cannot “solve” a Turnstile challenge the way you would solve a reCAPTCHA. There is no image to classify. The challenge is the browser environment itself.

Under Attack Mode

Site operators can enable “I’m Under Attack” mode, which forces a 5-second JavaScript challenge on every single request. This is the nuclear option. It catches virtually all non-browser traffic, but it also adds latency for legitimate visitors. Sites under active DDoS attacks use this temporarily. Some sites leave it on permanently for high-value endpoints like login pages and checkout flows.

How DataDome Detects Bots

DataDome takes a fundamentally different approach from Cloudflare. Instead of operating at the network edge, DataDome injects client-side JavaScript that performs deep behavioral analysis.

Device Fingerprinting

DataDome’s JavaScript collector builds a comprehensive device fingerprint that goes well beyond standard browser fingerprinting:

Canvas and WebGL fingerprinting: Rendering a specific image through the Canvas API and reading back the pixel data. Different GPU drivers, font rendering engines, and anti-aliasing implementations produce different outputs. WebGL extends this by querying the GPU renderer string, supported extensions, and shader precision formats.
AudioContext fingerprinting: Processing a short audio signal through the Web Audio API and measuring the output. Different audio hardware and driver stacks produce measurably different results.
Hardware concurrency and memory: navigator.hardwareConcurrency and navigator.deviceMemory reveal the machine’s CPU core count and available RAM. A headless Chrome instance claiming to be a mobile device but reporting 32 CPU cores and 64GB of RAM is an obvious contradiction.

Behavioral Analysis

DataDome’s real strength is behavioral modeling. The JavaScript collector tracks:

Mouse movement patterns: Velocity, acceleration, jitter, curvature between points. Human mouse movements follow predictable biomechanical patterns. Synthetic movements (linear interpolation, bezier curves with uniform speed) are statistically distinguishable.
Scroll behavior: Scroll speed, direction changes, momentum. Humans scroll with inertia and variable speed. Automated scrolling tends to be uniform.
Keystroke dynamics: Inter-key timing, key hold duration, typing rhythm. These patterns are unique enough to serve as biometric identifiers.
Touch events on mobile: Pressure, contact area, gesture patterns. DataDome uses these to distinguish real mobile devices from desktop browsers spoofing a mobile User-Agent.

Real-Time ML Detection

All of this telemetry streams to DataDome’s backend in real time. Their ML models classify sessions within milliseconds. The models are retrained continuously on fresh data, which means any bypass technique that works today has a limited shelf life. DataDome publishes detection rate improvements quarterly, and they consistently claim sub-second detection of new bot patterns.

The practical implication: even if you successfully replicate a browser environment well enough to pass the initial fingerprint check, DataDome will flag your session based on behavioral anomalies within the first few page interactions.

How PerimeterX (HUMAN) Detects Bots

PerimeterX was acquired by HUMAN Security in 2022 and now operates as part of their BotGuard product suite. Their detection methodology combines elements of both Cloudflare’s network-level analysis and DataDome’s behavioral approach, with some unique additions.

Sensor Data Collection

PerimeterX deploys a JavaScript sensor (typically loaded from a /px/ or /_sec/ path) that collects an extensive set of environmental and behavioral signals. The sensor data is encrypted and sent to PerimeterX’s backend as an opaque payload.

Key signals include:

Browser API consistency checks: PerimeterX tests whether browser APIs behave as expected. For example, it might check whether navigator.webdriver is present (indicating automation), whether window.chrome exists with the expected properties, or whether Notification.permission returns a plausible value. Headless browsers and automation frameworks often fail these consistency checks because they either expose automation flags or implement browser APIs incompletely.
JavaScript execution environment: The sensor checks for signs of instrumentation. Overridden prototypes, modified getter/setter behavior on native objects, non-standard toString() outputs on built-in functions. Tools like Puppeteer and Playwright modify the JavaScript environment in detectable ways, even with stealth plugins applied.
DOM structure analysis: PerimeterX inspects the DOM for artifacts left by automation frameworks. Puppeteer injects specific elements. Selenium leaves webdriver attributes. Even sophisticated setups sometimes leave traces in the DOM or in the browser’s internal state.

Behavioral Biometrics

Like DataDome, PerimeterX tracks mouse movements, keystrokes, and touch events. Their behavioral models focus specifically on detecting:

Replay attacks: If two sessions produce identical behavioral patterns, one of them is synthetic. PerimeterX maintains session-level behavioral profiles and flags statistical duplicates.
Timing anomalies: The time between page load and first interaction, between clicks, between page navigations. Bots tend to operate at machine speed, interacting with pages faster than any human could.

Proof-of-Work Challenges

When PerimeterX is uncertain about a session, it can issue a proof-of-work challenge. The browser must solve a computationally expensive problem (typically involving hash computation) before the request is allowed through. This serves two purposes: it adds latency that makes large-scale scraping more expensive, and it tests whether the client has the computational resources expected of a real browser on real hardware.

The difficulty of the proof-of-work challenge scales with suspicion level. A mildly suspicious session might get a challenge that takes 100ms to solve. A highly suspicious session might get one that takes several seconds.

Common Bypass Techniques (and Their Limitations)

Developers who scrape at scale have built an extensive toolkit of bypass techniques. Here are the most common approaches and why each one eventually fails.

Header Rotation and TLS Fingerprint Matching

The first thing most developers try is rotating User-Agent strings and matching request headers to look like a real browser. More sophisticated implementations go further, using libraries like curl-impersonate or tls-client to replicate the exact TLS fingerprint of a specific browser version.

Why it breaks: Cloudflare and PerimeterX now fingerprint beyond TLS. HTTP/2 frame parameters, header ordering heuristics, and connection behavior patterns all contribute to detection. Matching the TLS fingerprint of Chrome 124 is necessary but not sufficient. You also need to match Chrome 124’s HTTP/2 SETTINGS frame, its HPACK behavior, its pseudo-header ordering, and a dozen other protocol-level details that change with every browser release.

Browser Automation with Stealth Plugins

Puppeteer Extra with the stealth plugin, Playwright with custom launch arguments, undetected-chromedriver. These tools patch known detection vectors: removing navigator.webdriver, spoofing window.chrome, fixing the Permissions API, and normalizing WebGL output.

Why it breaks: Anti-bot vendors actively reverse-engineer these stealth plugins and add detection for the specific patches they apply. The stealth plugin’s patches become signatures themselves. When puppeteer-extra-plugin-stealth patches navigator.webdriver, it does so in a way that is subtly different from a browser where the flag was never present. PerimeterX’s sensor detects these differences.

The cat-and-mouse cycle is fast. A new stealth patch might work for a few weeks before the anti-bot vendor detects the specific modification pattern and flags it. You then need a new patch, which will also be detected eventually.

Residential Proxies

Residential proxies route traffic through real consumer IP addresses. Since anti-bot systems give higher trust scores to residential IPs than datacenter IPs, this can bypass IP reputation checks.

Why it breaks: Residential proxies solve only one dimension of the detection problem. If your TLS fingerprint, behavioral patterns, or browser environment are flagged, a clean IP address will not save you. Additionally, residential proxy providers recycle IPs across customers. If another customer burns an IP with aggressive scraping, your next request through that IP inherits the damage.

Cost is the other issue. Residential proxy bandwidth runs $5 to $15 per gigabyte depending on the provider and geography. At scale, proxy costs often exceed the cost of the actual scraping infrastructure.

CAPTCHA Solving Services

Services like 2Captcha, Anti-Captcha, and CapSolver use human workers or ML models to solve visual CAPTCHAs. You forward the challenge, they return the solution token, you submit it.

Why it breaks: Turnstile challenges (Cloudflare) and PerimeterX proof-of-work challenges are not traditional image CAPTCHAs. There is no image to solve. The challenge is the browser environment itself. CAPTCHA solving services cannot replicate a browser environment on your behalf.

For DataDome’s visual challenges, solving services can work, but the latency of forwarding the challenge, solving it, and returning the token (typically 10 to 30 seconds) is often enough for DataDome to flag the session based on timing anomalies. A real human solves a simple CAPTCHA in 2 to 5 seconds. A session that takes 25 seconds looks automated.

Why Manual Bypass is a Losing Game

Even if you build a bypass that works today against all three systems, here is what maintaining it looks like:

Update frequency: Cloudflare ships detection updates weekly. DataDome retrains models continuously. PerimeterX updates sensor code on a biweekly cycle. Your bypass has a half-life measured in days, not months.

Combinatorial complexity: Different sites configure different challenge thresholds, different fingerprinting strictness, and different fallback behaviors. A bypass that works against Site A’s Cloudflare configuration might fail on Site B’s, even though both use Cloudflare. You end up maintaining per-site configurations.

Infrastructure cost: Running headless browsers at scale requires significant compute. Add residential proxies, CAPTCHA solving services, and the engineering time to maintain bypass logic, and the total cost per page often exceeds what a managed service charges.

Risk concentration: One mistake (a leaked fingerprint pattern, a proxy provider that gets flagged, a stealth plugin update that introduces a new detection vector) can burn your entire IP pool and browser profile set. Recovery means rebuilding from scratch.

The trade-off depends on scale and diversity. If you are scraping one or two well-understood sites, maintaining custom bypass logic is manageable and gives you full control. If you are scraping across hundreds of domains with varying protection, per-site maintenance becomes untenable.

The Managed Approach: Let Spider Handle It

Spider was built from the ground up to handle anti-bot bypass as an infrastructure problem, not an application-level concern. When you send a request to Spider, the platform automatically detects what protections the target site uses and escalates its approach accordingly.

How Spider’s Smart Mode Works

Spider’s default smart mode inspects each target URL and selects the cheapest, fastest path that will succeed:

Static fetch: If the page serves content without JavaScript rendering or anti-bot challenges, Spider uses a lightweight HTTP fetch. This is the fastest and cheapest path.
Chrome rendering: If the page requires JavaScript execution, Spider routes through headless Chrome with full fingerprint management.
Anti-bot escalation: If the page is protected by Cloudflare, DataDome, PerimeterX, Akamai, Imperva, or Distil, Spider automatically escalates. This includes browser fingerprint rotation, proxy tier selection (datacenter, residential, or mobile), challenge handling, and automatic retries with different configurations on failure.

You do not need to specify which anti-bot system a site uses. You do not need to configure proxy tiers or browser profiles. Spider detects and adapts.

Built-In Capabilities

Proxy rotation across tiers: Datacenter, residential, and mobile proxies with automatic failover. Spider selects the appropriate tier based on the target site’s protection level.
Browser fingerprint management: Each session uses a consistent, realistic browser fingerprint. Fingerprints rotate across sessions to prevent cross-session correlation.
TLS and HTTP/2 fingerprint matching: Spider’s browser instances produce TLS and HTTP/2 fingerprints that match real consumer browsers, down to the SETTINGS frame parameters and header ordering.
Automatic retry with escalation: When a request fails, Spider does not simply retry with the same configuration. It escalates: trying a different proxy tier, a different browser profile, or a different challenge-handling strategy.
CAPTCHA handling: Spider handles common CAPTCHA challenges natively, without relying on third-party solving services.

Code Comparison: DIY vs. Spider

Here is what it looks like to scrape a Cloudflare-protected page yourself using Playwright with stealth settings, proxy rotation, and retry logic:

import random
from playwright.sync_api import sync_playwright

PROXIES = [
    "http://user:pass@residential1.example.com:8080",
    "http://user:pass@residential2.example.com:8080",
    "http://user:pass@residential3.example.com:8080",
]

USER_AGENTS = [
]

def scrape_with_bypass(url, max_retries=3):
    with sync_playwright() as p:
        for attempt in range(max_retries):
            browser = p.chromium.launch(
                proxy={"server": random.choice(PROXIES)},
                args=["--disable-blink-features=AutomationControlled"],
            )
            context = browser.new_context(
                user_agent=random.choice(USER_AGENTS),
                viewport={"width": 1920, "height": 1080},
            )
            page = context.new_page()
            try:
                page.goto(url, wait_until="networkidle")
                if "challenge" in page.content().lower():
                    print(f"Attempt {attempt + 1}: challenge detected")
                    continue
                return page.content()
            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
            finally:
                browser.close()
    raise Exception("All bypass attempts failed")

html = scrape_with_bypass("https://protected-site.example.com")

This is a reasonable starting point. Playwright is the right tool for DIY bypass in 2026. But it still does not handle TLS fingerprinting, HTTP/2 fingerprinting, behavioral emulation, or CAPTCHA solving. It will work for basic Cloudflare managed rules but fail against a well-configured DataDome or PerimeterX deployment.

Here is the same task with Spider:

import requests, os

response = requests.post(
    "https://api.spider.cloud/crawl",
    headers={
        "Authorization": f"Bearer {os.getenv('SPIDER_API_KEY')}",
        "Content-Type": "application/json",
    },
    json={
        "url": "https://protected-site.example.com",
        "return_format": "markdown",
    },
)

for page in response.json():
    print(page["content"])

Twelve lines. No proxy configuration. No stealth plugins. No retry logic. No fingerprint management. Spider detects the protection, selects the right approach, handles challenges, retries on failure, and returns clean content.

Comparison: DIY Bypass vs. Spider

Dimension	DIY Bypass	Spider
Cloudflare Turnstile	Requires stealth plugins, TLS fingerprint matching, residential proxies. Breaks as detection updates ship.	Handled automatically. Smart mode detects and escalates.
DataDome	Requires behavioral emulation, device fingerprint spoofing. Most open source tools cannot pass.	Built-in behavioral and fingerprint management.
PerimeterX / HUMAN	Requires sensor data collection, proof-of-work solver, browser API consistency.	Managed server-side. Challenge handling runs internally.
Debugging visibility	Full: you see every request, header, and response. You can trace exactly why a request was blocked.	Opaque: you get an error code but not the detection reason. If a page fails, you file a support request.
Custom challenge handling	Can implement site-specific logic for edge cases and unusual challenge flows.	Generic handling. Non-standard challenge flows may require waiting for platform updates.
Data residency	Traffic routes through your own infrastructure. Full control over exit IPs and intermediaries.	Traffic routes through Spider’s proxies. You do not control the exit IP geography.
Proxy costs	$5-15/GB for residential. $50-200/month for a usable pool.	Included in per-page pricing.
Maintenance	Ongoing updates to match detection changes. Per-site tuning.	Maintained by Spider’s team. You call the API.
Cost per 1,000 pages	$5-50+ depending on proxy tier and compute.	~$0.48 avg (bandwidth + compute, no credit multipliers).
Success rate	60-95% on well-protected sites, depending on tuning.	99.9% across production traffic (varies by protection level).

When You Should Build It Yourself

There are real advantages to maintaining your own bypass infrastructure:

Single-site, deep expertise. If you scrape one or two sites you deeply understand, custom bypass gives you full control and full visibility. You can tune behavior per-page, debug failures by inspecting every request, and react to detection changes on your own timeline rather than waiting for a vendor update.
Security research and auditing. Understanding the detection mechanism is the goal, not bypassing it. A managed service hides the details you need to study.
Data residency and compliance. Your traffic routes through your own infrastructure. No third-party proxy network touches the data.
Cost optimization at single-site scale. If you are scraping one well-understood site at high volume, a tuned DIY setup with curl-impersonate or Playwright can be cheaper than any managed API because you only pay for compute and proxies, with no per-page fee.

Anti-bot bypass is infrastructure, not application logic. Whether you build it yourself or use a managed service, the important thing is that your application code should not contain if cloudflare: do_x(); elif datadome: do_y(). That logic belongs in a layer below your application, maintained by people (or a team) whose focus is keeping up with detection changes.

The real question is whether anti-bot maintenance is a good use of your engineering team’s time. For most teams scraping across many domains, it is not. For teams deeply focused on one or two high-value targets, it might be.

Get started

Start crawling in 30 seconds.

One API key. No servers to manage.

Free balance on signup · No card required

Get started free

Read the docs