Skip to main content gottem  — one API for every scraper.
Transform API POST /transform

HTML and PDF in. Markdown out.

Send bytes you already have. Spider returns clean markdown, plain text, or sanitized HTML without re-fetching the page from the web.

Input HTML
<div class="nav-wrap">
<ul id="menu">...</ul>
</div>
<article>
<h1>Hello World</h1>
<p>Content here</p>
</article>
<footer>...</footer>
Output Markdown
# Hello World
 
Content here
nav, footer, classes stripped
0.1 credits / page

No browser rendering, no proxy overhead. Pure content conversion at the lowest cost per page.

Transform vs Scrape

Why pick Transform instead?

Already have the content

Sitting on HTML from your own crawlers, or PDFs from an S3 bucket? Transform converts them without paying for another network request or spinning up a browser.

Cost efficient

Starting at 0.1 credits for HTML and 10 credits per PDF page, Transform is the lowest-cost endpoint in the platform. Zero browser or proxy overhead.

Batch processing

Send an array of HTML documents in one request. Process entire collections of saved pages in a single API call, up to 10 MB total.

Cleaning levels

Three levels, one parameter each.

Light

Standard

Basic HTML-to-format conversion. Preserves all content structure including navigation, footers, and sidebars.

No flags needed
Recommended

AI clean

Removes navigation, footers, ads, and boilerplate. Keeps main article content, optimized for feeding into language models.

"clean": true
Deep

Full clean

Strips all non-essential HTML attributes: classes, IDs, inline styles. Produces minimal, semantic markup.

"clean_full": true
Capabilities

What the converter handles.

Readability extraction

Enable readability to extract just the main content using Mozilla's readability algorithm. Perfect for articles and blog posts.

Multiple output formats

Convert to markdown, text, or sanitized html. Markdown for LLMs, text for NLP, clean HTML for re-rendering.

URL context

Pass the source URL alongside HTML so relative links resolve to absolute URLs. Ensures links in markdown output work correctly.

Batch input

Send an array of {html, url} objects. Transform dozens of pages in a single request to minimize round-trips.

PDF to markdown

Send PDF bytes, get structured markdown. Tables, headings, lists, and reading order are preserved. Handles scanned documents with built-in OCR.

10 MB payload

Process up to 10 MB of HTML per request. Large pages, long articles, and complex documents handled without truncation.

Examples

cURL, Python, Node.

from spider import Spider

client = Spider()

html_content = "<html><body><h1>Hello</h1><p>World</p></body></html>"

result = client.transform(
    [{ "html": html_content, "url": "https://example.com" }],
    params={
        "return_format": "markdown",
        "clean": True,
    },
)

print(result[0]["content"])
# Output: # Hello\n\nWorld
Use cases

Where teams reach for it.

01

Post-processing cached content

You've already saved HTML from your own crawlers or a CDN cache. Transform converts it to clean markdown without consuming browser or proxy credits.

02

Email & newsletter parsing

Convert HTML emails into readable text or markdown for indexing, summarization, or feeding into language models.

03

CMS content migration

Export HTML from one CMS and transform it to markdown for import into a static site generator, wiki, or headless CMS.

04

PDF ingestion for RAG pipelines

Pull structured text from research papers, 10-Ks, contracts, and technical specs. Feed clean markdown directly into your vector store or LLM context window.

Related

More from the API.

Get started

Ready to transform?

One endpoint for HTML and PDF. Clean markdown out, every time.