Transform API
Drop in raw HTML or PDF bytes. Get back structured markdown, clean text, or sanitized HTML. No browser, no proxy, no re-fetching. Just the conversion.
Why Use Transform Instead of Scrape?
Already Have the Content
Sitting on HTML from your own crawlers, or PDFs from an S3 bucket? Transform converts them without paying for another network request or spinning up a browser.
Cost Efficient
Starting at 0.1 credits for HTML and 10 credits per PDF page, Transform is the lowest-cost endpoint in the platform. Zero browser or proxy overhead.
Batch Processing
Send an array of HTML documents in one request. Process entire collections of saved pages in a single API call, up to 10 MB total.
Three Cleaning Levels
Standard
Basic HTML-to-format conversion. Preserves all content structure including navigation, footers, and sidebars.
No flags needed AI Clean
Removes navigation, footers, ads, and boilerplate. Keeps main article content, optimized for feeding into language models.
"clean": true Full Clean
Strips all non-essential HTML attributes: classes, IDs, inline styles. Produces minimal, semantic markup.
"clean_full": true Key Capabilities
Readability Extraction
Enable readability to extract just the main content using Mozilla's readability algorithm. Perfect for articles and blog posts.
Multiple Output Formats
Convert to markdown, text, or sanitized html. Markdown for LLMs, text for NLP, clean HTML for re-rendering.
URL Context
Pass the source URL alongside HTML so relative links resolve to absolute URLs. Ensures links in markdown output work correctly.
Batch Input
Send an array of {html, url} objects. Transform dozens of pages in a single request to minimize round-trips.
PDF to Markdown
Send PDF bytes, get structured markdown. Tables, headings, lists, and reading order are preserved. Handles scanned documents with built-in OCR.
10 MB Payload
Process up to 10 MB of HTML per request. Large pages, long articles, and complex documents handled without truncation.
Code Examples
from spider import Spider
client = Spider()
html_content = "<html><body><h1>Hello</h1><p>World</p></body></html>"
result = client.transform(
[{ "html": html_content, "url": "https://example.com" }],
params={
"return_format": "markdown",
"clean": True,
}
)
print(result[0]["content"])
# Output: # Hello
World curl -X POST https://api.spider.cloud/transform \
-H "Authorization: Bearer $SPIDER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"data": [
{"html": "<h1>Page One</h1><p>Content...</p>", "url": "https://example.com/1"},
{"html": "<h1>Page Two</h1><p>Content...</p>", "url": "https://example.com/2"}
],
"return_format": "markdown",
"readability": true
}' import Spider from "@spider-cloud/spider-client";
const client = new Spider();
const result = await client.transform(
[{ html: "<h1>Hello</h1><p>World</p>", url: "https://example.com" }],
{
return_format: "markdown",
readability: true,
}
);
console.log(result[0].content); Popular Use Cases
Post-Processing Cached Content
You've already saved HTML from your own crawlers or a CDN cache. Transform converts it to clean markdown without consuming browser or proxy credits.
Email & Newsletter Parsing
Convert HTML emails into readable text or markdown for indexing, summarization, or feeding into language models.
CMS Content Migration
Export HTML from one CMS and transform it to markdown for import into a static site generator, wiki, or headless CMS.
PDF Ingestion for RAG Pipelines
Pull structured text from research papers, 10-Ks, contracts, and technical specs. Feed clean markdown directly into your vector store or LLM context window.