In 2026, every AI builder—whether you’re building RAG pipelines, autonomous agents, or custom LLMs—needs clean, structured web data. Raw HTML is noisy, JavaScript-rendered pages break most scrapers, and anti-bot measures are everywhere. Enter FireCrawl: the open-source “Web Data API for AI” that turns entire websites into LLM-ready markdown, structured JSON, screenshots, and more.

With 85k+ GitHub stars and active development (latest release just weeks ago), FireCrawl has become the go-to for turning the internet into tokens. Best of all? You can self-host it for free with a few Docker commands—perfect for privacy-conscious teams, internal tools, or unlimited crawling without burning through API credits.

In this post I’ll walk you through:

  • What FireCrawl actually does (and why it’s special)
  • Step-by-step self-hosting guide (Docker + .env tweaks)
  • How to use it locally (Python/Node SDK examples)
  • Honest comparisons with Crawl4AI, AnyCrawl, and Scrapling

Let’s dive in.

What Is FireCrawl?

FireCrawl (formerly MendableAI) is an API-first crawler/scraper purpose-built for AI applications. Its core promise: give it a URL (or a whole domain) and get back clean, token-efficient markdown or perfectly structured data that your LLM can actually use.

Key capabilities (all work in self-hosted mode except where noted):

  • Scrape – Single page → markdown, raw HTML, screenshot, links, or JSON (via Pydantic schema or natural-language prompt)
  • Crawl – Entire website with depth limits, URL filters, sitemap respect, and async queuing
  • Map – Lightning-fast URL discovery (sitemap on steroids)
  • Search – Web search + optional scraping of results (uses SearXNG in self-host)
  • Actions – Click, scroll, type, wait, screenshot before extraction (great for dynamic SPAs)
  • Media & files – PDF/DOCX/image text extraction
  • Change tracking – Monitor pages over time
  • Structured extraction – LLM-powered or schema-based JSON (requires OpenAI/Ollama key)

It handles JavaScript rendering via Playwright, proxies, auth walls, and produces output that’s 80%+ cleaner than most competitors according to their benchmarks.

License note: The server core is AGPL-3.0 (network use triggers source-sharing obligations), SDKs are MIT.

Why Self-Host FireCrawl?

  • Zero usage limits or credits
  • Full data privacy (nothing leaves your network)
  • Customize everything (proxies, local LLMs via Ollama, custom Playwright)
  • Cost savings at scale
  • Perfect for air-gapped environments or compliance-heavy orgs

Cloud version still wins for production scale (advanced anti-bot “Fire-engine”, managed queues, /agent endpoint), but self-hosted is excellent for dev, staging, or internal tools.

Step-by-Step: Self-Hosting FireCrawl (Docker – Easiest & Recommended)

Prerequisites

  • Docker + Docker Compose (v2+)
  • ~4–8 GB RAM recommended (Playwright + Redis + Postgres eat resources)
  • Git

1. Clone the repo

git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl

2. Create .env

cp apps/api/.env.example .env

Edit .env (minimum viable):

PORT=3002
HOST=0.0.0.0
USE_DB_AUTHENTICATION=false   # Supabase not supported in self-host

# For structured extraction / LLM features (optional but powerful)
OPENAI_API_KEY=sk-...          # or use Ollama
# OLLAMA_BASE_URL=http://host.docker.internal:11434   # on Mac/Windows

# Queue admin (change this!)
BULL_AUTH_KEY=supersecret123

# Optional but recommended
ALLOW_LOCAL_WEBHOOKS=true
MAX_CPU=0.85
MAX_RAM=0.85

3. (Optional) Use TypeScript Playwright service for better JS handling
Uncomment/change in docker-compose.yml:

playwright-service:
  build: apps/playwright-service-ts

4. Start it

docker compose build   # first time only, ~5–10 min
docker compose up -d

5. Verify

  • API: http://localhost:3002
  • Health check: curl http://localhost:3002/test
  • Queue dashboard: http://localhost:3002/admin/supersecret123/queues

That’s it. You now have a fully functional FireCrawl instance.

Troubleshooting tips

  • Containers fail? docker compose logs -f api
  • Redis/Postgres connection issues? Check volumes and ports.
  • High CPU? Lower MAX_CPU or give the host more resources.
  • Want Kubernetes? There’s an example in /examples/kubernetes.

Using Your Self-Hosted FireCrawl

Python SDK (recommended)

pip install firecrawl-py
from firecrawl import FirecrawlApp
from pydantic import BaseModel

app = FirecrawlApp(
    api_key="any-key-works-locally",  # ignored when auth disabled
    api_url="http://localhost:3002"   # ← point here
)

# Simple scrape
result = app.scrape_url("https://docs.firecrawl.dev", formats=["markdown"])
print(result.markdown)

# Structured extraction
class Product(BaseModel):
    name: str
    price: float
    features: list[str]

result = app.scrape_url(
    "https://example.com/product",
    formats=[{"type": "json", "schema": Product.model_json_schema()}]
)
print(result.json)  # perfectly typed data

Crawl an entire site

crawl_result = app.crawl_url(
    "https://docs.firecrawl.dev",
    params={
        "limit": 50,
        "scrapeOptions": {"formats": ["markdown"]},
        "maxDepth": 3
    }
)
# Poll or use webhook for completion

Node.js works identically—just change the base URL in the SDK.

You can also hit the REST endpoints directly with curl or any HTTP client (same payloads as the cloud docs).

Pro tip: Combine with LangChain’s FireCrawlLoader or LlamaIndex for instant RAG.

FireCrawl vs. the Competition (2026 Edition)

Here’s how it stacks up against the three most mentioned alternatives:

FeatureFireCrawl (self-hosted)Crawl4AI (61k stars)AnyCrawl (Node.js)Scrapling (Python)
Primary languageTypeScript (API)Python libraryNode.js/TSPython framework
Ease of full-site crawl★★★★★ (one call)★★★★★★★★★★★ (needs spider code)
LLM-ready markdownExcellent (cleanest)Excellent + filtersVery good + OCRGood (adaptive)
Structured extractionSchema + LLM promptCSS/JSON/LLM strategiesLLM JSONManual selectors + AI optional
JS renderingBuilt-in PlaywrightPlaywright + stealthPlaywright/PuppeteerOptional stealth fetcher
Self-host complexityMedium (Docker Compose)Very easy (pip + optional Docker)Docker Compose or npmpip install (lightweight)
Anti-bot / stealthBasic + proxiesGoodGood + proxies★★★★★ (learns site changes)
SERP / searchBuilt-in (SearXNG)NoExcellent (Google/Bing/etc.)No
Resource usageHigher (queues, Postgres, Redis)LowMediumLowest
Community / Stars85k+61k~2.7k16k
Best forTurnkey web→LLM APIPython RAG pipelinesNode high-throughput + SERPStealthy custom scrapers

Quick verdict

  • Choose FireCrawl when you want the “just works” experience for entire websites → clean markdown → LLM. The API consistency between cloud and self-host is a huge win.
  • Choose Crawl4AI if you live in Python and want maximum customization + lightweight deployment inside your existing codebase.
  • Choose AnyCrawl for Node.js backends or heavy SERP work.
  • Choose Scrapling when you need to scrape tricky, frequently-changing sites that block everything else (its adaptive parser is magic).

Many teams actually use FireCrawl for the heavy lifting and fall back to Scrapling/Crawl4AI for edge cases.

Pros & Cons of Self-Hosted FireCrawl

Pros

  • Unlimited & private
  • Full feature parity on core endpoints
  • Local LLM support (Ollama)
  • Queue monitoring UI
  • Easy to add custom proxies or Playwright tweaks

Cons

  • Heavier than pure libraries (needs Redis/Postgres/Playwright)
  • /agent and advanced cloud-only anti-bot missing
  • AGPL obligations if you expose it publicly
  • You’re responsible for scaling & maintenance

Final Thoughts

FireCrawl has democratized high-quality web data for AI. Whether you use the generous cloud tier or self-host on a $5 VPS, you’ll spend far less time fighting scrapers and more time building actually useful agents.

Ready to try it?

  1. Star the repo: https://github.com/firecrawl/firecrawl
  2. Run the Docker setup above (takes <15 min)
  3. Point your SDK at http://localhost:3002
  4. Watch your RAG accuracy skyrocket

Happy crawling—and may your markdown be forever clean! 🔥

You may also like

Subscribe
Notify of
guest

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments