What is focused crawling?
Focused crawling limits a crawler to a specific topic, domain set, or content category rather than following all discoverable links. Where a general crawler builds broad coverage, a focused crawler prioritizes relevance: it either pre-filters URLs by pattern before fetching, or evaluates page content after fetching and discards pages that fall outside the target subject. The result is a smaller dataset with higher signal-to-noise ratio, at lower bandwidth and compute cost than crawling everything.
| Factor | General crawling | Focused crawling |
|---|---|---|
| Scope | All reachable URLs | Topic or domain subset |
| Dataset size | Large, mixed relevance | Smaller, high relevance |
| Filtering approach | Minimal | URL patterns or content classifiers |
| Crawl strategy | Typically breadth-first | Depth-first for deep topic clusters |
| Primary use case | Search indexes, web archives | LLM training data, domain research |
Focused crawling makes the most sense when you need high-quality, on-topic content: training a domain-specific model on technical documentation, collecting product data from a curated list of retailers, or building a research corpus from a specific publication type. The core design decision is when to filter: pre-fetching by URL pattern is fast but shallow, since relevant pages may live at unpredictable paths; post-fetching by content classifier is more accurate but costs a request per discarded page. Most production focused crawlers combine both: tight crawl scope rules as a first pass, followed by content-level filtering on what gets kept. A depth-first strategy works better than breadth-first when relevant content clusters deep in a site's structure.
Firecrawl's Crawl API supports focused crawling through path filters and domain constraints, returning clean Markdown per page so content filtering can run directly on extracted text without needing to parse HTML.