GitHub - spider-rs/gottem: Router for scraping vendors under one API

Universal scraper that reliably gets the data.

What is gottem

gottem is one CLI and one Rust library that talks to every major scraping vendor — and your local browser — through a single tiered ladder. You give it a URL; it tries the lowest-cost way first, escalates when blocked, races vendors when speed matters, and stops when it gets clean content.

Each "way to fetch" is called a route. Routes are described in TOML and matched to one of a small set of adapters (plain HTTP, JSON API, streaming JSONL, headless Chrome over CDP, CAPTCHA solver). Adding a new vendor is a TOML row — no code change, no release.

Powered by spider.

Install

cargo install gottem-cli

This installs the gottem binary on your PATH from crates.io — no checkout needed. Building from source (e.g. for unreleased changes) is also fine:

git clone https://github.com/spider-rs/gottem.git
cd gottem
cargo install --path crates/gottem-cli

Try it in 30 seconds

# Inspect what's available — no API keys needed yet.
gottem routes list
gottem routes show spider.smart

# Tell gottem which vendor keys you have.
export FIRECRAWL_API_KEY=fc-...
export SPIDER_API_KEY=sk-...

# Fetch a URL. gottem starts at the lowest-cost tier, escalates if the basic routes fail.
gottem fetch https://example.com --show-meta

# Race three routes in parallel — fastest valid response wins.
gottem fetch https://example.com --mode race --routes firecrawl.scrape,spider.http,zenrows.basic

# Hedge: start at the lowest tier, fire a backup at the next tier after a delay.
gottem fetch https://example.com --mode hedge --hedge-delay-ms 2000

# Probe every tier on a target URL — useful for picking a baseline.
gottem probe https://hard-to-scrape.test

The tier ladder

Lower tier = lower-cost and faster. Higher tier = handles tougher anti-bot defenses. gottem walks the ladder lowest-cost-first by default and stops at the first route that returns valid content.

Tier	Typical cost	What's at this level
T0	free	direct local HTTP (you bring the URL, we send a GET)
T1–T3	varies	local HTTP through a proxy, or local headless Chrome
T4	$0.001	basic cloud HTTP (Firecrawl, Spider HTTP, ScrapingBee, ZenRows)
T5	$0.005	cloud HTTP with JS render
T6	$0.0075	cloud HTTP + residential proxy
T7	$0.008–0.010	smart unblockers — auto-fallback inside the vendor (Spider Smart, Zyte, Brightdata Unblocker)
T8	$0.010–0.015	browser-as-a-service over CDP (Brightdata Scraping Browser, Browserless, Spider Browser Cloud)
T9	$0.02+	last-resort: multi-step actors, premium scraping APIs, CAPTCHA solvers

You can pin the tier band you want with --tier-min / --tier-max, or hard-cap cost per fetch with --budget-mc.

Built-in vendors

20 routes across 11 services. All you need is the env var.

Vendor	Routes (count)	Env var
Spider	4	`SPIDER_API_KEY`
Firecrawl	2	`FIRECRAWL_API_KEY`
ZenRows	3	`ZENROWS_API_KEY`
ScrapingBee	3	`SCRAPINGBEE_API_KEY`
Brightdata Web Unlocker	1	`BRIGHTDATA_TOKEN`
Zyte API	1	`ZYTE_API_KEY`
Brightdata Scraping Browser	1	`BRIGHTDATA_BROWSER`
Browserless	1	`BROWSERLESS_TOKEN`
Spider Browser Cloud	1	`SPIDER_API_KEY` (shared)
Apify	1	`APIFY_API_TOKEN`
Oxylabs Web Scraper	1	`OXYLABS_USER` + `OXYLABS_PASS`
2Captcha solver	1	`2CAPTCHA_API_KEY` (¹)

Don't see your vendor? Drop a TOML file in crates/gottem-routes-builtin/routes/ and you're done. See Adding a vendor below.

¹ 2CAPTCHA_API_KEY starts with a digit, so POSIX shells (bash, zsh) refuse export 2CAPTCHA_API_KEY=.... Use a .env loader, prefix the binary with env 2CAPTCHA_API_KEY=..., or inject through your CI's secret store. Rust reads it via std::env::var regardless of how it got set.

The three modes

`--mode ladder` (default)

Try the lowest-cost route first. If the response fails validation (too short, WAF challenge, 5xx), escalate one tier and try again. Stop at the first valid response, the budget ceiling, or --max-retries.

Best for: most batch jobs. Cost-optimal.

`--mode race`

Fire all selected routes in parallel. First valid response wins; the rest are cancelled mid-flight.

Best for: latency-critical fetches when budget allows duplicate cost.

`--mode hedge`

Fire route 0 at t=0. If it doesn't return quickly, fire route 1 at t = --hedge-delay-ms. Then route 2 at 2× that delay, and so on. First valid wins. The delay shrinks adaptively when latency variance is bad — slow tails get hedged more aggressively automatically.

Best for: high-throughput pipelines where most fetches are low-cost but the long tail kills you.

Crawling

gottem also crawls — streaming, no in-memory accumulation. One CLI:

# Local BFS using your scrape ladder for each page.
gottem crawl https://example.com --depth 2 --limit 50

# Force the Spider /crawl endpoint (native JSONL streaming).
gottem crawl https://example.com --engine spider-cloud --depth 3 --limit 100

# Subdomains, allow/deny patterns, robots.txt.
gottem crawl https://example.com \
    --subdomains \
    --allow /blog --allow /docs \
    --deny /admin --deny '\.pdf$' \
    --respect-robots \
    --concurrency 8

# Dynamic params forwarded to the route body template — vendor-specific knobs
# without editing the TOML.
gottem crawl https://example.com --param waitFor=2000 --param mode=chrome

Output is NDJSON to stdout — one PageEntry per line, flushed immediately. Pipe to jq, tee, Postgres COPY, whatever. Memory stays constant regardless of crawl size.

Two engines, picked by --engine:

Engine	What it does
`spider-cloud`	POSTs to Spider's `/crawl` and streams the JSONL response back. Single round-trip per crawl, vendor handles fanout. Requires `SPIDER_API_KEY`.
`local`	BFS owned by gottem: every URL goes through the same scrape ladder you'd use for a one-off `fetch`, so per-page escalation (T0 → T7) still works mid-crawl. Link discovery uses `spider::page::Page::links` on bytes already returned — no re-fetch for outlink extraction. Visited / depth / allow / deny / robots / budget all delegated to `spider::website::Website`.
`auto`	Spider if the key is set, else local. (Default.)

Library use — subscriber sugar over the raw Stream:

use std::sync::Arc;
use gottem_core::{CancelToken, ControlFlow, CrawlRequest, Orchestrator};
use url::Url;

let orch: Arc<Orchestrator> = /* built with crawl adapters installed */;
orch.crawl_builder(
        CrawlRequest::new(Url::parse("https://example.com")?)
            .with_limit(50)
            .with_depth(2),
    )
    .on_page(|page| async move {
        save_to_db(page).await;
        ControlFlow::Continue
    })
    .run(CancelToken::new())
    .await?;

Or the raw stream:

let mut stream = orch.crawl(req, CancelToken::new()).await?;
while let Some(page) = futures_util::StreamExt::next(&mut stream).await {
    /* ... */
}

The crawl never grows memory beyond what the consumer is holding — pages flow through, are yielded, and dropped. The local engine runs N concurrent workers (--concurrency, default 4) on the multi-threaded runtime.

CAPTCHA chains

gottem ships a 2Captcha adapter at T9 that you compose into your pipeline when a vendor returns a challenge page:

Run the primary fetch through the ladder.
Detect a CAPTCHA in the response (your code or a validator).
Call the captcha.2captcha route, passing siteKey + captchaType in req.extra.
Receive a solved token as content.
Replay the original URL with the token embedded (cookie / form field / header — depends on the captcha).

The solver handles 2Captcha's two-step submit-then-poll protocol internally — you just call it once. Supports reCAPTCHA v2, hCaptcha, and Cloudflare Turnstile.

Routes are config, not code

Every vendor in gottem is one TOML row. Here's the entire Firecrawl route:

[[route]]
id          = "firecrawl.scrape"
adapter     = "http_json"
endpoint    = "https://api.firecrawl.dev/v1/scrape"
method      = "POST"
tier        = 4
cost        = 10
timeout_ms  = 30000

[route.auth]
kind = "bearer"
env  = "FIRECRAWL_API_KEY"

[route.body]
kind     = "json"
template = '''{"url":"{{url}}","formats":["markdown"]}'''

[route.parse]
kind = "json_path"
path = "$.data.markdown"

[[route.validate]]
kind = "min_bytes"
n    = 500

Adding ZenRows-style query-string auth is the same pattern with {{env:NAME}} in the endpoint URL. There are five adapters that cover essentially every scraping API in the wild:

direct_http — plain GET/POST
http_json — POST JSON, parse JSON (Firecrawl, Zyte, Brightdata, Apify, Oxylabs)
http_jsonl_stream — POST JSON, parse streaming JSONL (Spider)
chrome_cdp — WebSocket CDP (Brightdata Scraping Browser, Browserless)
captcha_2captcha — submit + poll (2Captcha)

You can also point gottem at your own --config routes.toml to layer custom routes on top of the built-ins.

Modes recap

gottem fetch URL                                  # ladder, default
gottem fetch URL --mode race --routes a,b,c       # race A B C in parallel
gottem fetch URL --mode hedge --hedge-count 2     # primary + 2 staggered backups
gottem fetch URL --budget-mc 100                  # cap at $0.01 per fetch
gottem fetch URL --tier-min 4 --tier-max 7        # skip local; cap below T8
gottem fetch URL --require-js                     # only routes that render JS
gottem fetch URL --format json                    # structured output with metadata

Using it as a library

use std::sync::Arc;
use gottem_core::{Budget, CancelToken, LadderStrategy, Orchestrator,
                  RouteCatalogBuilder, ScrapeRequest, Tier, AdapterRegistry, Capabilities};
use url::Url;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let catalog = Arc::new(
        gottem_routes_builtin::register_all(RouteCatalogBuilder::new())?.build()
    );

    let mut registry = AdapterRegistry::new();
    gottem_adapters_http::register_all(&mut registry, None);
    registry.register(gottem_adapters_spider::SpiderAdapter::arc());

    let orch = Arc::new(Orchestrator::new(
        catalog.clone(),
        Arc::new(registry),
        Arc::new(Budget::new(1_000)),  // $0.10 ceiling
    ));

    let strategy = Arc::new(LadderStrategy::new(
        catalog.clone(), Tier::T0, Tier::T9, Capabilities::default(), 5,
    ));

    let resp = orch.fetch(
        ScrapeRequest::get(Url::parse("https://example.com")?),
        strategy,
        CancelToken::new(),
    ).await?;

    println!("{}", resp.content.unwrap_or_default());
    Ok(())
}

Inspecting the catalog

gottem routes list           # tabular view of every loaded route
gottem routes show <id>      # full detail for one route
gottem routes validate       # check that every route's env var is set

routes validate exits 0 when every env var is present, exits 2 with a list otherwise — handy in CI.

Hosted gottem — gottem.dev

Don't want to manage vendor keys or run a browser? The same engine runs as a managed API at gottem.dev. Sign up, create a gtm_ key, and call it — no keys to wrangle, pay-as-you-go credits.

From the CLI — `--remote`

The gottem CLI can run any fetch against the hosted API instead of the local ladder. Set your key once and add --remote:

export GOTTEM_API_KEY=gtm_your_key_here

# Runs on api.gottem.dev — no local vendor keys, no browser.
gottem fetch --remote https://example.com

# All the usual flags carry over to the hosted run.
gottem fetch --remote --mode race --show-meta https://example.com
gottem fetch --remote --format json https://example.com

# Or pass the key explicitly instead of the env var.
gottem fetch --remote --api-key gtm_your_key_here https://example.com

$GOTTEM_API_URL overrides the base URL (defaults to https://api.gottem.dev).

From any HTTP client

# Hosted equivalent of `gottem fetch <url>`.
curl -X POST https://api.gottem.dev/scrape \
  -H "Authorization: Bearer gtm_your_key_here" \
  -H "content-type: application/json" \
  -d '{"url": "https://example.com"}'

# /v1/compare — runs every provider and compares quality, cost, and content
# side by side. Great for picking a route.
curl -X POST https://api.gottem.dev/v1/compare \
  -H "Authorization: Bearer gtm_your_key_here" \
  -H "content-type: application/json" \
  -d '{"url": "https://example.com"}'

The CLI and the hosted API share the same route catalog and escalation logic — prototype locally with this crate, run production traffic on the hosted API. Full reference: gottem.dev/docs.

What's inside

gottem/
├── assets/                          logo, dark-mode logo, icon
└── crates/
    ├── gottem-core                  traits, types, orchestrator, retry strategies
    ├── gottem-adapters-http         direct_http · http_json · http_jsonl_stream
    ├── gottem-adapters-spider       T0–T3 local fetching via spider::Website
    ├── gottem-adapters-chrome       T8 CDP via `chromey` (chromiumoxide fork)
    ├── gottem-adapters-captcha      T9 2Captcha solver chain primitive
    ├── gottem-routes-builtin        embedded vendor TOML, feature-gated per vendor
    └── gottem-cli                   `gottem` binary — fetch · probe · routes

Every adapter and every vendor is behind a Cargo feature, so you can build a CLI with only the routes you actually need.

License

Apache-2.0 OR MIT, your choice.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
assets		assets
crates		crates
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SKILL.md		SKILL.md
llms.txt		llms.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is gottem

Install

Try it in 30 seconds

The tier ladder

Built-in vendors

The three modes

`--mode ladder` (default)

`--mode race`

`--mode hedge`

Crawling

CAPTCHA chains

Routes are config, not code

Modes recap

Using it as a library

Inspecting the catalog

Hosted gottem — gottem.dev

From the CLI — `--remote`

From any HTTP client

What's inside

License

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is gottem

Install

Try it in 30 seconds

The tier ladder

Built-in vendors

The three modes

--mode ladder (default)

--mode race

--mode hedge

Crawling

CAPTCHA chains

Routes are config, not code

Modes recap

Using it as a library

Inspecting the catalog

Hosted gottem — gottem.dev

From the CLI — --remote

From any HTTP client

What's inside

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`--mode ladder` (default)

`--mode race`

`--mode hedge`

From the CLI — `--remote`

Packages