PromptHub
Developer Tools Artificial Intelligence

Stop Overpaying for Web Scraping! Markdowner Converts Any Site to LLM-Ready Data for Free

B

Bright Coding

Author

13 min read
9 views
Stop Overpaying for Web Scraping! Markdowner Converts Any Site to LLM-Ready Data for Free

Stop Overpaying for Web Scraping! Markdowner Converts Any Site to LLM-Ready Data for Free

What if I told you that developers are burning hundreds of dollars monthly on web scraping APIs—when they could be getting the same results for free?

Here's the dirty secret of the AI development world: every LLM-powered application needs clean, structured data from the web. Whether you're building a RAG pipeline, training models, or creating AI agents that browse the internet, you need to convert messy HTML into something your models can actually understand. And until now, your options were basically: pay through the nose for proprietary APIs like Firecrawl or Jina AI, or wrestle with brittle open-source solutions that break faster than you can debug them.

But what if there was a third path? One that gives you browser-based rendering, automatic crawling, LLM-powered content filtering, and subpage discovery—all running on Cloudflare's global edge network?

Meet Markdowner—the open-source project that's making expensive web scraping APIs obsolete. Built by Dhravya Shah for his AI app Supermemory, this tool is already turning heads in the developer community. Even Nexxel, a respected voice in the space, couldn't help but share his excitement about what Markdowner delivers.

In this deep dive, I'll show you exactly why top developers are quietly switching to Markdowner, how to get it running in under 10 minutes, and why its architecture might be the smartest deployment of Cloudflare Workers you've seen this year. Let's tear this thing apart.


What Is Markdowner? The Open-Source Weapon Against Expensive Scraping APIs

Markdowner is a fast, self-hostable tool that converts any website into clean, LLM-ready markdown data. Born from the real-world needs of Supermemory—an AI application where users store website content and query it with natural language—this isn't theoretical software built in a vacuum. It's battle-tested infrastructure that solved a genuine pain point.

The creator, Dhravya Shah, noticed something critical while building Supermemory: when data is structured and predictable in markdown format, LLM responses improve dramatically. Unstructured HTML with navigation menus, ads, footers, and random div soup? Your model gets confused, hallucinates, or wastes precious context tokens on irrelevant noise. Clean markdown? Your AI suddenly understands what matters.

Existing solutions fell into predictable traps. Jina AI and Firecrawl are powerful, sure, but they're either proprietary black boxes with usage limits and escalating costs, or they require complex deployments that devour engineering time. For indie hackers, bootstrapped startups, and teams watching their cloud bills, this friction is unacceptable.

Markdowner breaks this trilemma by leveraging Cloudflare's edge infrastructure—specifically Browser Rendering and Durable Objects—to spin up real browser instances, execute JavaScript-heavy pages, and convert the rendered DOM into pristine markdown using Turndown. The result? A tool that's simultaneously free to run (within Workers limits), infinitely scalable, and globally distributed.

And here's what makes this genuinely exciting: the entire architecture runs on Cloudflare's network, meaning your scraping operations execute milliseconds from your users, not in some distant AWS region. Low latency. No cold starts. No Kubernetes clusters to babysit.


The Feature Set That Makes Proprietary Tools Nervous

Let's dissect what Markdowner actually delivers, because the feature list punches well above its weight class:

Universal Website Conversion

Markdowner doesn't just fetch HTML and strip tags. It uses real browser rendering through Cloudflare's infrastructure, meaning it handles JavaScript frameworks, SPAs, dynamically loaded content, and pages that would break simple HTTP-fetch tools. React? Vue? Next.js with server components? It just works.

LLM-Powered Content Filtering

This is where it gets clever. Enable llmFilter and Markdowner uses an LLM to intelligently strip navigation elements, ads, cookie banners, and other noise before returning your markdown. You're not just getting any markdown—you're getting relevant markdown that preserves semantic structure without the cruft.

Detailed Response Mode

Need the full picture? enableDetailedResponse includes complete HTML content alongside the markdown conversion. Perfect for debugging, archival, or hybrid pipelines where you need both structured and raw data.

Auto-Crawling Without Sitemaps

Here's a feature that expensive competitors charge premium rates for: crawlSubpages automatically discovers and converts up to 10 linked subpages. No XML sitemap required. No manual URL list. Just point it at a root domain and watch it intelligently traverse internal links.

Flexible Response Formats

Your pipeline needs JSON? Set Content-Type: application/json. Prefer plain text for direct LLM injection? Use Content-Type: text/plain. Markdowner adapts to your architecture, not the other way around.

Zero-Cost Self-Hosting

The entire stack deploys to Cloudflare Workers. If you're already on the paid plan (required for Browser Rendering and Durable Objects), your marginal cost is essentially zero. Compare that to per-page pricing that scales punishingly at volume.


Real-World Scenarios Where Markdowner Dominates

Still wondering if this fits your stack? Here are four concrete scenarios where Markdowner isn't just convenient—it's transformative:

1. RAG Pipeline Data Ingestion

You're building a retrieval-augmented generation system that answers questions from documentation, competitor websites, or research sources. Traditional scraping gives you garbage HTML that your chunking strategy struggles with. Markdowner's clean output means your embeddings capture actual semantic meaning, your retrieval accuracy jumps, and your LLM generates better answers with fewer hallucinations.

2. AI Agent Web Browsing

Your autonomous agent needs to navigate websites, extract information, and make decisions. Feeding it raw HTML is like asking someone to read a newspaper through frosted glass. Markdowner's structured output—complete with preserved headings, lists, and tables—gives your agent readable, actionable context that improves reasoning and reduces error rates.

3. Competitive Intelligence at Scale

Monitor competitor pricing, feature announcements, or documentation changes without building a custom scraping framework. Deploy Markdowner instances across regions, schedule periodic crawls with crawlSubpages, and feed changes into your analysis pipeline. The LLM filtering automatically focuses on substantive content changes, ignoring navigation updates and cosmetic tweaks.

4. Content Migration and Archival

Migrating from an old CMS? Archiving a dying platform? Markdowner's detailed mode captures both clean markdown and original HTML, giving you a future-proof archive. The automatic subpage crawling means you don't miss orphaned pages that escaped your manual audit.


Step-by-Step: From Zero to Self-Hosted in 10 Minutes

Ready to stop reading and start deploying? Here's the complete setup process:

Prerequisites

  • Cloudflare account with Workers Paid Plan (required for Browser Rendering and Durable Objects)
  • Node.js 18+ installed locally
  • npx available in your path

Step 1: Clone and Install

# Clone the repository
git clone https://github.com/supermemoryai/markdowner
cd markdowner

# Install dependencies
npm install

Step 2: Create KV Namespace for Caching

Markdowner uses Cloudflare KV for intelligent caching, avoiding redundant browser renders:

npx wrangler kv:namespace create md_cache

This outputs something like:

✨ Success!
Add the following to your configuration file:
[[kv_namespaces]]
binding = "md_cache"
id = "your-generated-namespace-id"

Step 3: Configure Wrangler.toml

Open wrangler.toml in your editor and update the KV namespace ID:

name = "markdowner"
main = "src/index.ts"
compatibility_date = "2024-01-01"

# Add the KV namespace from Step 2
[[kv_namespaces]]
binding = "md_cache"
id = "your-actual-namespace-id-here"

Step 4: Deploy to Cloudflare

npm run deploy

That's it. Your instance is live on a *.workers.dev subdomain. Custom domain binding takes two more clicks in the Cloudflare dashboard.

Environment Verification

Test your deployment:

curl 'https://your-worker.your-subdomain.workers.dev/?url=https://example.com' \
  -H 'Content-Type: text/plain'

Code Deep-Dive: Real Examples from the Repository

Let's examine actual usage patterns, starting with the simplest possible call and progressing to advanced configurations.

Basic Conversion: The One-Liner

The foundation of everything—convert any URL to markdown with a single GET request:

# Minimal viable usage: just the URL parameter
curl 'https://md.dhr.wtf/?url=https://example.com'

What's happening under the hood? Cloudflare's Browser Rendering spins up a headless Chromium instance at the edge location nearest to the request origin. The page fully renders—including any JavaScript execution—then Turndown converts the DOM tree to GitHub-flavored markdown. The result is streamed back as plain text by default.

JSON Response with LLM Filtering

For production pipelines, you'll want structured responses and intelligent content filtering:

# Advanced request with all optional parameters
curl 'https://md.dhr.wtf/?url=https://news.ycombinator.com&llmFilter=true&enableDetailedResponse=true' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json'

Parameter breakdown:

  • llmFilter=true: Invokes an LLM to strip navigation, ads, and irrelevant elements before markdown conversion. This costs slightly more in processing time but dramatically improves output quality for content-heavy pages.
  • enableDetailedResponse=true: Returns both markdown and html fields in the JSON response, useful for verification or fallback processing.
  • Content-Type: application/json: Explicitly requests JSON formatting; without this, you'll receive plain text.

Expected response structure:

{
  "url": "https://news.ycombinator.com",
  "markdown": "# Hacker News\n\n## 1. Show HN: Markdowner...",
  "html": "<!DOCTYPE html>...",
  "filtered": true,
  "pages_crawled": 1
}

Multi-Page Crawling for Documentation Sites

The killer feature for documentation ingestion—automatic subpage discovery:

# Crawl up to 10 linked subpages automatically
curl 'https://md.dhr.wtf/?url=https://docs.python.org/3/&crawlSubpages=true' \
  -H 'Content-Type: application/json'

Critical implementation note: Unlike sitemap-based crawlers, Markdowner uses heuristic link analysis on the rendered page to identify internal navigation. This means it works on sites without sitemaps, but also means it respects the same-origin policy and won't follow external links. The 10-page limit prevents runaway crawling; for larger sites, implement pagination in your calling logic.

Response for crawled pages:

{
  "root_page": {
    "url": "https://docs.python.org/3/",
    "markdown": "# Python 3.12.1 documentation\n\n..."
  },
  "subpages": [
    {
      "url": "https://docs.python.org/3/tutorial/",
      "markdown": "# The Python Tutorial\n\n..."
    },
    {
      "url": "https://docs.python.org/3/library/",
      "markdown": "# The Python Standard Library\n\n..."
    }
  ],
  "total_pages": 3
}

Self-Hosted Instance with Custom Domain

Once deployed, your self-hosted instance works identically:

# Replace with your actual worker domain
BASE_URL="https://markdowner.your-domain.workers.dev"

# Batch process multiple URLs
cat urls.txt | while read url; do
  encoded_url=$(python3 -c "import urllib.parse; print(urllib.parse.quote('$url'))")
  curl -s "${BASE_URL}/?url=${encoded_url}&llmFilter=true" \
    -H 'Content-Type: text/plain' \
    >> output.md
done

This pattern demonstrates batch processing with URL encoding—critical for URLs containing query parameters or special characters. The plain text output appends cleanly to a single markdown file for downstream processing.


Advanced Usage & Pro Tips

After running Markdowner in production, here are optimization strategies that aren't obvious from the README:

Cache Warming for Critical Paths

Since Markdowner uses KV caching, pre-warm your cache for frequently accessed URLs. A simple cron trigger on your Worker can refresh stale content before users request it:

// Add to your Worker triggers in Cloudflare dashboard
// Runs every hour to refresh top 100 cached URLs

Selective Filtering with Custom LLM Prompts

The open-source nature means you can modify the LLM filtering prompt in src/index.ts. Want to preserve code blocks but strip all images? Prefer table conversions over list representations? Adjust the system prompt to match your downstream pipeline's preferences.

Rate Limiting and Cost Control

Browser Rendering has per-request costs on Cloudflare's paid plan. Implement custom rate limiting in your Worker by adding a simple counter in Durable Objects—something the base repository leaves as an exercise, but production deployments absolutely need.

Hybrid Architecture: Markdowner + Vector DB

For RAG systems, don't store raw markdown. Pipe Markdowner output directly into an embedding pipeline:

Website → Markdowner → Chunking → Embeddings → Pinecone/Weaviate → LLM Query

This eliminates the "garbage in, garbage out" problem that plagues naive RAG implementations.


Markdowner vs. The Competition: No Contest?

Feature Markdowner Jina AI Firecrawl DIY Puppeteer
Cost Free (self-hosted) Per-request pricing Tiered SaaS Infrastructure + maintenance
Browser Rendering ✅ Native (Cloudflare) ✅ Yes ✅ Yes ✅ Manual setup
LLM Filtering ✅ Built-in ❌ No ✅ Yes ❌ Manual integration
Auto-Crawling ✅ Up to 10 pages ❌ Single URL ✅ Yes (paid tiers) ❌ Custom logic needed
Self-Hostable ✅ Fully open ❌ No ❌ No ✅ Yes
Edge Deployment ✅ Global (250+ cities) ❌ Centralized ❌ Centralized ❌ Your problem
Setup Complexity 10 minutes API key only API key only Hours to days
Response Formats Text + JSON Text + JSON JSON + Markdown Whatever you build

The verdict? If you need a quick API call and don't mind per-request costs, Jina AI and Firecrawl are fine. But if you're processing volume, care about data sovereignty, want customization, or simply refuse to pay recurring SaaS fees for infrastructure you can own—Markdowner is the clear winner.


FAQ: What Developers Actually Ask

Is Markdowner really free?

The code is 100% free and open-source under the repository license. Running it requires Cloudflare's Workers Paid Plan ($5/month minimum) for Browser Rendering and Durable Objects access. For most use cases, this is dramatically cheaper than per-request alternatives at scale.

Can I use the public API (md.dhr.wtf) commercially?

The public endpoint is provided as a convenience but comes with no SLA or rate guarantee. For production workloads, self-hosting is strongly recommended—and trivial to set up.

How does it handle JavaScript-heavy sites like SPAs?

Cloudflare Browser Rendering executes full Chromium instances, waiting for network idle before conversion. React, Vue, Angular, and modern frameworks render correctly. Extremely lazy-loaded content may require parameter tuning.

What's the difference between llmFilter and standard conversion?

Standard conversion applies Turndown's heuristic rules. llmFilter adds an LLM pass that understands semantic importance—preserving article content while removing navigation, ads, and boilerplate with much higher accuracy.

Can I increase the 10-page crawl limit?

The limit is hardcoded but easily modified in the source. Just be mindful of Cloudflare's Browser Rendering usage quotas and implement appropriate rate limiting.

Does it support authentication-required pages?

Not directly in the current release. For authenticated content, you'd need to extend the Worker to accept and inject cookies or session tokens into the browser context.

How does caching work?

The KV namespace stores rendered markdown keyed by URL. Identical requests bypass browser rendering entirely, returning cached results in milliseconds. Cache invalidation is manual or time-based.


The Bottom Line: Own Your Data Pipeline

Here's what it comes down to: the companies building the AI future are increasingly unwilling to depend on black-box APIs for core infrastructure. Markdowner represents a broader shift toward composable, self-hostable tools that give developers control without sacrificing capability.

Is it perfect? No. Authentication support could be richer. The crawl limit is conservative. You'll need to monitor your Cloudflare usage. But for 90% of web-to-markdown use cases, it delivers enterprise-grade functionality at indie-hacker pricing—which is exactly zero ongoing cost beyond your existing Cloudflare plan.

The architecture is genuinely clever: Durable Objects for stateful browser management, KV for intelligent caching, edge deployment for global performance. This isn't a toy project. It's production infrastructure that happens to be free.

My recommendation? Star the repository, deploy it this afternoon, and run your next scraping job through it. Compare the output quality against whatever you're paying for now. I suspect you'll be canceling a subscription before the week ends.

⭐ Star Markdowner on GitHub — and if it saves you money, consider supporting the creator's work on Supermemory. The best open-source tools deserve thriving ecosystems around them.


What's your current web-to-LLM pipeline? Drop your setup in the comments—I'm curious whether Markdowner could replace something you're paying for.

Comments (0)

Comments are moderated before appearing.

No comments yet. Be the first to share your thoughts!

Support us! ☕