Self-hosting FireCrawl: The Complete Guide to Your Open-Source Web Data API for AI Agents

In 2026, every AI builder—whether you’re building RAG pipelines, autonomous agents, or custom LLMs—needs clean, structured web data. Raw HTML is noisy, JavaScript-rendered pages break most scrapers, and anti-bot measures are everywhere. Enter FireCrawl: the open-source “Web Data API for AI” that turns entire websites into LLM-ready markdown, structured JSON, screenshots, and more.

With 85k+ GitHub stars and active development (latest release just weeks ago), FireCrawl has become the go-to for turning the internet into tokens. Best of all? You can self-host it for free with a few Docker commands—perfect for privacy-conscious teams, internal tools, or unlimited crawling without burning through API credits.

In this post I’ll walk you through:

What FireCrawl actually does (and why it’s special)
Step-by-step self-hosting guide (Docker + .env tweaks)
How to use it locally (Python/Node SDK examples)
Honest comparisons with Crawl4AI, AnyCrawl, and Scrapling

Let’s dive in.

What Is FireCrawl?

FireCrawl (formerly MendableAI) is an API-first crawler/scraper purpose-built for AI applications. Its core promise: give it a URL (or a whole domain) and get back clean, token-efficient markdown or perfectly structured data that your LLM can actually use.

Key capabilities (all work in self-hosted mode except where noted):

Scrape – Single page → markdown, raw HTML, screenshot, links, or JSON (via Pydantic schema or natural-language prompt)
Crawl – Entire website with depth limits, URL filters, sitemap respect, and async queuing
Map – Lightning-fast URL discovery (sitemap on steroids)
Search – Web search + optional scraping of results (uses SearXNG in self-host)
Actions – Click, scroll, type, wait, screenshot before extraction (great for dynamic SPAs)
Media & files – PDF/DOCX/image text extraction
Change tracking – Monitor pages over time
Structured extraction – LLM-powered or schema-based JSON (requires OpenAI/Ollama key)

It handles JavaScript rendering via Playwright, proxies, auth walls, and produces output that’s 80%+ cleaner than most competitors according to their benchmarks.

License note: The server core is AGPL-3.0 (network use triggers source-sharing obligations), SDKs are MIT.

Why Self-Host FireCrawl?

Zero usage limits or credits
Full data privacy (nothing leaves your network)
Customize everything (proxies, local LLMs via Ollama, custom Playwright)
Cost savings at scale
Perfect for air-gapped environments or compliance-heavy orgs

Cloud version still wins for production scale (advanced anti-bot “Fire-engine”, managed queues, /agent endpoint), but self-hosted is excellent for dev, staging, or internal tools.

Step-by-Step: Self-Hosting FireCrawl (Docker – Easiest & Recommended)

Prerequisites

Docker + Docker Compose (v2+)
~4–8 GB RAM recommended (Playwright + Redis + Postgres eat resources)
Git

1. Clone the repo

git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl

2. Create .env

cp apps/api/.env.example .env

Edit .env (minimum viable):

PORT=3002
HOST=0.0.0.0
USE_DB_AUTHENTICATION=false   # Supabase not supported in self-host

# For structured extraction / LLM features (optional but powerful)
OPENAI_API_KEY=sk-...          # or use Ollama
# OLLAMA_BASE_URL=http://host.docker.internal:11434   # on Mac/Windows

# Queue admin (change this!)
BULL_AUTH_KEY=supersecret123

# Optional but recommended
ALLOW_LOCAL_WEBHOOKS=true
MAX_CPU=0.85
MAX_RAM=0.85

3. (Optional) Use TypeScript Playwright service for better JS handling
Uncomment/change in docker-compose.yml:

playwright-service:
  build: apps/playwright-service-ts

4. Start it

docker compose build   # first time only, ~5–10 min
docker compose up -d

5. Verify

API: http://localhost:3002
Health check: curl http://localhost:3002/test
Queue dashboard: http://localhost:3002/admin/supersecret123/queues

That’s it. You now have a fully functional FireCrawl instance.

Troubleshooting tips

Containers fail? docker compose logs -f api
Redis/Postgres connection issues? Check volumes and ports.
High CPU? Lower MAX_CPU or give the host more resources.
Want Kubernetes? There’s an example in /examples/kubernetes.

Using Your Self-Hosted FireCrawl

Python SDK (recommended)

pip install firecrawl-py

from firecrawl import FirecrawlApp
from pydantic import BaseModel

app = FirecrawlApp(
    api_key="any-key-works-locally",  # ignored when auth disabled
    api_url="http://localhost:3002"   # ← point here
)

# Simple scrape
result = app.scrape_url("https://docs.firecrawl.dev", formats=["markdown"])
print(result.markdown)

# Structured extraction
class Product(BaseModel):
    name: str
    price: float
    features: list[str]

result = app.scrape_url(
    "https://example.com/product",
    formats=[{"type": "json", "schema": Product.model_json_schema()}]
)
print(result.json)  # perfectly typed data

Crawl an entire site

crawl_result = app.crawl_url(
    "https://docs.firecrawl.dev",
    params={
        "limit": 50,
        "scrapeOptions": {"formats": ["markdown"]},
        "maxDepth": 3
    }
)
# Poll or use webhook for completion

Node.js works identically—just change the base URL in the SDK.

You can also hit the REST endpoints directly with curl or any HTTP client (same payloads as the cloud docs).

Pro tip: Combine with LangChain’s FireCrawlLoader or LlamaIndex for instant RAG.

FireCrawl vs. the Competition (2026 Edition)

Here’s how it stacks up against the three most mentioned alternatives:

Feature	FireCrawl (self-hosted)	Crawl4AI (61k stars)	AnyCrawl (Node.js)	Scrapling (Python)
Primary language	TypeScript (API)	Python library	Node.js/TS	Python framework
Ease of full-site crawl	★★★★★ (one call)	★★★★	★★★★	★★★ (needs spider code)
LLM-ready markdown	Excellent (cleanest)	Excellent + filters	Very good + OCR	Good (adaptive)
Structured extraction	Schema + LLM prompt	CSS/JSON/LLM strategies	LLM JSON	Manual selectors + AI optional
JS rendering	Built-in Playwright	Playwright + stealth	Playwright/Puppeteer	Optional stealth fetcher
Self-host complexity	Medium (Docker Compose)	Very easy (pip + optional Docker)	Docker Compose or npm	pip install (lightweight)
Anti-bot / stealth	Basic + proxies	Good	Good + proxies	★★★★★ (learns site changes)
SERP / search	Built-in (SearXNG)	No	Excellent (Google/Bing/etc.)	No
Resource usage	Higher (queues, Postgres, Redis)	Low	Medium	Lowest
Community / Stars	85k+	61k	~2.7k	16k
Best for	Turnkey web→LLM API	Python RAG pipelines	Node high-throughput + SERP	Stealthy custom scrapers

Quick verdict

Choose FireCrawl when you want the “just works” experience for entire websites → clean markdown → LLM. The API consistency between cloud and self-host is a huge win.
Choose Crawl4AI if you live in Python and want maximum customization + lightweight deployment inside your existing codebase.
Choose AnyCrawl for Node.js backends or heavy SERP work.
Choose Scrapling when you need to scrape tricky, frequently-changing sites that block everything else (its adaptive parser is magic).

Many teams actually use FireCrawl for the heavy lifting and fall back to Scrapling/Crawl4AI for edge cases.

Pros & Cons of Self-Hosted FireCrawl

Pros

Unlimited & private
Full feature parity on core endpoints
Local LLM support (Ollama)
Queue monitoring UI
Easy to add custom proxies or Playwright tweaks

Cons

Heavier than pure libraries (needs Redis/Postgres/Playwright)
/agent and advanced cloud-only anti-bot missing
AGPL obligations if you expose it publicly
You’re responsible for scaling & maintenance

Final Thoughts

FireCrawl has democratized high-quality web data for AI. Whether you use the generous cloud tier or self-host on a $5 VPS, you’ll spend far less time fighting scrapers and more time building actually useful agents.

Ready to try it?

Star the repo: https://github.com/firecrawl/firecrawl
Run the Docker setup above (takes <15 min)
Point your SDK at http://localhost:3002
Watch your RAG accuracy skyrocket

Happy crawling—and may your markdown be forever clean! 🔥