In 2026, every AI builder—whether you’re building RAG pipelines, autonomous agents, or custom LLMs—needs clean, structured web data. Raw HTML is noisy, JavaScript-rendered pages break most scrapers, and anti-bot measures are everywhere. Enter FireCrawl: the open-source “Web Data API for AI” that turns entire websites into LLM-ready markdown, structured JSON, screenshots, and more.
With 85k+ GitHub stars and active development (latest release just weeks ago), FireCrawl has become the go-to for turning the internet into tokens. Best of all? You can self-host it for free with a few Docker commands—perfect for privacy-conscious teams, internal tools, or unlimited crawling without burning through API credits.
In this post I’ll walk you through:
- What FireCrawl actually does (and why it’s special)
- Step-by-step self-hosting guide (Docker + .env tweaks)
- How to use it locally (Python/Node SDK examples)
- Honest comparisons with Crawl4AI, AnyCrawl, and Scrapling
Let’s dive in.
What Is FireCrawl?
FireCrawl (formerly MendableAI) is an API-first crawler/scraper purpose-built for AI applications. Its core promise: give it a URL (or a whole domain) and get back clean, token-efficient markdown or perfectly structured data that your LLM can actually use.
Key capabilities (all work in self-hosted mode except where noted):
- Scrape – Single page → markdown, raw HTML, screenshot, links, or JSON (via Pydantic schema or natural-language prompt)
- Crawl – Entire website with depth limits, URL filters, sitemap respect, and async queuing
- Map – Lightning-fast URL discovery (sitemap on steroids)
- Search – Web search + optional scraping of results (uses SearXNG in self-host)
- Actions – Click, scroll, type, wait, screenshot before extraction (great for dynamic SPAs)
- Media & files – PDF/DOCX/image text extraction
- Change tracking – Monitor pages over time
- Structured extraction – LLM-powered or schema-based JSON (requires OpenAI/Ollama key)
It handles JavaScript rendering via Playwright, proxies, auth walls, and produces output that’s 80%+ cleaner than most competitors according to their benchmarks.
License note: The server core is AGPL-3.0 (network use triggers source-sharing obligations), SDKs are MIT.
Why Self-Host FireCrawl?
- Zero usage limits or credits
- Full data privacy (nothing leaves your network)
- Customize everything (proxies, local LLMs via Ollama, custom Playwright)
- Cost savings at scale
- Perfect for air-gapped environments or compliance-heavy orgs
Cloud version still wins for production scale (advanced anti-bot “Fire-engine”, managed queues, /agent endpoint), but self-hosted is excellent for dev, staging, or internal tools.
Step-by-Step: Self-Hosting FireCrawl (Docker – Easiest & Recommended)
Prerequisites
- Docker + Docker Compose (v2+)
- ~4–8 GB RAM recommended (Playwright + Redis + Postgres eat resources)
- Git
1. Clone the repo
git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl2. Create .env
cp apps/api/.env.example .envEdit .env (minimum viable):
PORT=3002
HOST=0.0.0.0
USE_DB_AUTHENTICATION=false # Supabase not supported in self-host
# For structured extraction / LLM features (optional but powerful)
OPENAI_API_KEY=sk-... # or use Ollama
# OLLAMA_BASE_URL=http://host.docker.internal:11434 # on Mac/Windows
# Queue admin (change this!)
BULL_AUTH_KEY=supersecret123
# Optional but recommended
ALLOW_LOCAL_WEBHOOKS=true
MAX_CPU=0.85
MAX_RAM=0.853. (Optional) Use TypeScript Playwright service for better JS handling
Uncomment/change in docker-compose.yml:
playwright-service:
build: apps/playwright-service-ts4. Start it
docker compose build # first time only, ~5–10 min
docker compose up -d5. Verify
- API:
http://localhost:3002 - Health check:
curl http://localhost:3002/test - Queue dashboard:
http://localhost:3002/admin/supersecret123/queues
That’s it. You now have a fully functional FireCrawl instance.
Troubleshooting tips
- Containers fail?
docker compose logs -f api - Redis/Postgres connection issues? Check volumes and ports.
- High CPU? Lower
MAX_CPUor give the host more resources. - Want Kubernetes? There’s an example in
/examples/kubernetes.
Using Your Self-Hosted FireCrawl
Python SDK (recommended)
pip install firecrawl-pyfrom firecrawl import FirecrawlApp
from pydantic import BaseModel
app = FirecrawlApp(
api_key="any-key-works-locally", # ignored when auth disabled
api_url="http://localhost:3002" # ← point here
)
# Simple scrape
result = app.scrape_url("https://docs.firecrawl.dev", formats=["markdown"])
print(result.markdown)
# Structured extraction
class Product(BaseModel):
name: str
price: float
features: list[str]
result = app.scrape_url(
"https://example.com/product",
formats=[{"type": "json", "schema": Product.model_json_schema()}]
)
print(result.json) # perfectly typed dataCrawl an entire site
crawl_result = app.crawl_url(
"https://docs.firecrawl.dev",
params={
"limit": 50,
"scrapeOptions": {"formats": ["markdown"]},
"maxDepth": 3
}
)
# Poll or use webhook for completionNode.js works identically—just change the base URL in the SDK.
You can also hit the REST endpoints directly with curl or any HTTP client (same payloads as the cloud docs).
Pro tip: Combine with LangChain’s
FireCrawlLoaderor LlamaIndex for instant RAG.
FireCrawl vs. the Competition (2026 Edition)
Here’s how it stacks up against the three most mentioned alternatives:
| Feature | FireCrawl (self-hosted) | Crawl4AI (61k stars) | AnyCrawl (Node.js) | Scrapling (Python) |
|---|---|---|---|---|
| Primary language | TypeScript (API) | Python library | Node.js/TS | Python framework |
| Ease of full-site crawl | ★★★★★ (one call) | ★★★★ | ★★★★ | ★★★ (needs spider code) |
| LLM-ready markdown | Excellent (cleanest) | Excellent + filters | Very good + OCR | Good (adaptive) |
| Structured extraction | Schema + LLM prompt | CSS/JSON/LLM strategies | LLM JSON | Manual selectors + AI optional |
| JS rendering | Built-in Playwright | Playwright + stealth | Playwright/Puppeteer | Optional stealth fetcher |
| Self-host complexity | Medium (Docker Compose) | Very easy (pip + optional Docker) | Docker Compose or npm | pip install (lightweight) |
| Anti-bot / stealth | Basic + proxies | Good | Good + proxies | ★★★★★ (learns site changes) |
| SERP / search | Built-in (SearXNG) | No | Excellent (Google/Bing/etc.) | No |
| Resource usage | Higher (queues, Postgres, Redis) | Low | Medium | Lowest |
| Community / Stars | 85k+ | 61k | ~2.7k | 16k |
| Best for | Turnkey web→LLM API | Python RAG pipelines | Node high-throughput + SERP | Stealthy custom scrapers |
Quick verdict
- Choose FireCrawl when you want the “just works” experience for entire websites → clean markdown → LLM. The API consistency between cloud and self-host is a huge win.
- Choose Crawl4AI if you live in Python and want maximum customization + lightweight deployment inside your existing codebase.
- Choose AnyCrawl for Node.js backends or heavy SERP work.
- Choose Scrapling when you need to scrape tricky, frequently-changing sites that block everything else (its adaptive parser is magic).
Many teams actually use FireCrawl for the heavy lifting and fall back to Scrapling/Crawl4AI for edge cases.
Pros & Cons of Self-Hosted FireCrawl
Pros
- Unlimited & private
- Full feature parity on core endpoints
- Local LLM support (Ollama)
- Queue monitoring UI
- Easy to add custom proxies or Playwright tweaks
Cons
- Heavier than pure libraries (needs Redis/Postgres/Playwright)
- /agent and advanced cloud-only anti-bot missing
- AGPL obligations if you expose it publicly
- You’re responsible for scaling & maintenance
Final Thoughts
FireCrawl has democratized high-quality web data for AI. Whether you use the generous cloud tier or self-host on a $5 VPS, you’ll spend far less time fighting scrapers and more time building actually useful agents.
Ready to try it?
- Star the repo: https://github.com/firecrawl/firecrawl
- Run the Docker setup above (takes <15 min)
- Point your SDK at
http://localhost:3002 - Watch your RAG accuracy skyrocket
Happy crawling—and may your markdown be forever clean! 🔥








