Skip to article
Deep dive

The 5 API verbs that replaced my scraping stack

Scrape, crawl, map, extract, search — five verbs that replaced a dozen tools in our own pipeline. What each one does and when to call it.

Stekpad Team9 min read
On this page

I used to run a scraping stack that was eleven moving parts on a good day and sixteen on a bad one. Python. BeautifulSoup for the easy pages. Playwright for the rendered ones. A Celery worker pool because nothing is sync at scale. Redis as the job queue and the rate-limit cache. Postgres for the results. S3 for the HTML blobs I was too paranoid to throw away. A custom FastAPI wrapper so the rest of the company could hit it without learning Celery. A cron job for the nightly re-crawls. A separate Hunter script for email finding. A Python worker that called Clearbit for company info. A Google Cloud Run deployment budget I was embarrassed to show to finance.

Fifty thousand URLs a month. One-person maintenance burden. Two pages of runbook. One outage every three weeks when Cloudflare rotated a fingerprint and my Playwright config stopped working at 4 a.m.

Then I deleted 80 % of it and replaced it with five verbs. scrape. crawl. map. extract. search. Five API calls. One credit wallet. Every result lands in a dataset I can re-query. This is the story of what came out, what replaced it, and the three things I still run myself because I should.

The stack that was

For posterity, here is what the old diagram looked like.

text
[cron] --> [FastAPI wrapper] --> [Celery]
|
+---------------+---------------+
| | |
[BS4 worker] [Playwright worker] [Hunter]
| | |
+-------+-------+ |
| |
[Postgres] <-- [Clearbit] --+
|
[S3 blobs]

Each box was a thing I owned. Each arrow was a thing I debugged. The boxes I liked (Postgres, S3, the FastAPI wrapper) are still in the new diagram. The boxes I did not like (the Celery pool, the Playwright config, the rate-limit logic, the Hunter script, the Clearbit script, the proxy rotation, the robots.txt parser I half-wrote in a weekend) are gone.

Verb 1 — scrape

What it replaced: the BS4 worker, the Playwright worker, the proxy rotation, the cookie handling, half the runbook.

`scrape` is the "fetch one URL, return content" verb. It takes a URL, a format list (markdown, json, html, screenshot), and a handful of optional fields (actions, timeout, use_session). It returns a run_id, the content in the formats you asked for, a metadata object with schema.org and Open Graph, and a dataset_row_id for the persistent row.

bash
curl -X POST https://api.stekpad.com/v1/scrape \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com/item?id=40000000",
"formats": ["markdown"],
"dataset": "hn-mentions"
}'

In the old stack, this was 40 lines of Python, a Celery task, a decision tree between BS4 and Playwright, a proxy retry loop, and a Postgres insert. Now it is one call. What got deleted: three files, a worker type, a proxy credential rotation, and the "is this page rendered client-side?" branch.

What I gained: a response that already contains markdown (not HTML I need to parse), schema.org data I was not bothering to extract, and a row in the hn-mentions dataset I can query_dataset later for free. One credit per scrape. 1,000 scrapes for the price of two espressos.

Verb 2 — crawl

What it replaced: the Celery fan-out, the URL frontier, the canonical-URL dedupe, the nightly recrawl cron.

`crawl` walks a site. You pass a starting URL, include and exclude path rules, a depth, an optional sitemap hint, and a format list. You get back a run_id. The run is async — you poll GET /v1/runs/:id or subscribe to a webhook. Every page lands in the same dataset as a row.

bash
curl -X POST https://api.stekpad.com/v1/crawl \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.stripe.com/api",
"include_paths": ["/api/**"],
"exclude_paths": ["/api/deprecated/**"],
"canonical_dedupe": true,
"formats": ["markdown"],
"dataset": "stripe-api-docs",
"webhook": "https://my-app.com/hooks/stekpad"
}'

My old stack's crawl logic was 200 lines of Python plus a Redis priority queue plus a robots.txt parser plus a URL normalizer that I was mostly sure was correct. It worked until it did not. I gave up on content-hash dedupe entirely because I never got the normalization right.

The new stack gives me canonical-URL dedupe by default, honors robots, supports include/exclude paths as glob patterns, and webhooks me when the run is done. Eleven webhook events in total — run.queued, run.running, run.progress, run.completed, run.failed, run.cancelled, row.added, row.changed, enrichment.completed, credits.low, session.unavailable. My nightly recrawl cron became a single POST /v1/crawl scheduled on the Cloud Starter plan.

Verb 3 — map

What it replaced: the URL discovery script, the sitemap fetcher, the "how big is this site?" spreadsheet.

`map` is the verb I did not know I wanted until I had it. It lists URLs for a site without fetching the bodies. Sitemaps, robots.txt, a shallow link walk — all combined into a single list. Fast, cheap, and perfect as a dry-run before a targeted crawl.

bash
curl -X POST https://api.stekpad.com/v1/map \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{ "url": "https://docs.stripe.com", "max_urls": 5000 }'

Cost: 1 credit per 1,000 URLs returned. For a site with 4,200 pages, that is 5 credits — less than half a cent on the PAYG Pro pack. In the old stack, the only way I had to answer "how many URLs does docs.stripe.com have?" was to fire up a crawl and watch it. That cost dozens of euros in compute minutes and fetched the bodies of pages I had not yet decided I wanted.

Now my workflow is always the same: map first, eyeball the URL list, decide which subset I want, then crawl exactly that subset. Two verbs, zero waste.

Verb 4 — extract

What it replaced: the BS4 selector files, the schema guessing, the "re-parse every page when the layout changes" Sunday afternoons.

`extract` takes a URL, a JSON schema, and an optional natural-language prompt. It returns validated, typed fields. Behind the scenes it runs an LLM cascade — Gemma 4 first, Haiku if Gemma fails the schema, Sonnet if Haiku fails, with two automatic retries on schema errors. Powered by Cloudflare Workers AI through Vercel AI Gateway.

bash
curl -X POST https://api.stekpad.com/v1/extract \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products/widget-42",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price_eur": { "type": "number" },
"in_stock": { "type": "boolean" },
"variants": { "type": "array", "items": { "type": "string" } }
},
"required": ["name", "price_eur", "in_stock"]
},
"prompt": "Extract the main product, not related items."
}'

5 credits per URL. Two retries included. No CSS selectors. No XPath. No breaking every time marketing redesigns the product page. I used to have a directory called selectors/ with one file per target site, and every week someone's shop swapped their .price class for a data-price attribute and I spent Sunday afternoon fixing it. That directory is gone. The model reads the page and fills the schema.

The trick that makes extract actually reliable is the LLM cascade plus schema validation. Gemma 4 handles ~85 % of pages. Haiku handles another ~12 %. Sonnet handles the last ~3 %. I pay the cheap-model price on most rows and the expensive-model price only when I have to. I never see the cascade in the API — I just see my JSON.

What it replaced: the Google Custom Search API, the Brave Search API I glued myself, the Hunter domain search.

`search` runs a web search and optionally scrapes every result in the same call. Under the hood it uses Brave Search by default and drops the top N URLs into a mini scrape fan-out. One verb, one API call, one dataset write.

bash
curl -X POST https://api.stekpad.com/v1/search \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"query": "open source RAG frameworks site:github.com",
"num_results": 10,
"scrape_results": true,
"formats": ["markdown"],
"dataset": "rag-research"
}'

5 credits plus 1 per scraped result. 10 results = 15 credits. Results land in the rag-research dataset with url, title, markdown, metadata. The verb I use most often in prototyping — it is the single most powerful "give my agent a starting point" call in the whole stack.

The old stack, now

For the curious, here is what is left of the original diagram.

text
[cron] --> [FastAPI wrapper] --> [Stekpad API]
|
[datasets]
|
[enrichers]
|
[my Postgres for business logic]

Three boxes survived. The cron. The FastAPI wrapper. My Postgres. Everything else is gone — the Celery pool, the Redis queue, the Playwright workers, the BeautifulSoup parsers, the proxy rotation, the robots.txt parser, the URL normalizer, the S3 blob store, the Hunter script, the Clearbit script, the selector files. Eight fewer things to maintain. Eight fewer outages.

My Google Cloud Run bill dropped from ~420 €/month to ~40 €/month. My Stekpad credit spend is ~80 €/month at the same 50k URL volume, on the PAYG Pro pack. Total: ~120 €/month versus ~420 €. Three and a half times cheaper, plus the labor I am not paying myself on Sundays.

What I still maintain myself

This section matters. Stekpad is not a magic "delete your whole backend" button, and any post that pretends it is, is lying to you.

My business logic. The rules that turn a scraped row into a signal that matters to my product — "is this a direct competitor?", "is this a lead we can talk to?", "does this price change trigger an alert?" — are still mine. Stekpad gives me the raw material, I decide what is meaningful. This is where I spend most of my engineering time now, which is the correct place to spend it.

Row cleanup and deduplication against my existing records. Stekpad dedupes scraped rows against each other by canonical URL. It does not dedupe against my own customer database or my existing CRM. A nightly job still reads new rows, normalizes fields, and merges them into my Postgres.

Custom enrichers I need that Stekpad does not ship. I have one custom enricher that pings an internal API to mark which domains belong to existing customers. Stekpad's 19 enrichers cover the public-data side. My one custom step runs in my own FastAPI wrapper, reads the dataset via GET /v1/datasets/:id/rows, writes the result back. Total code: ~60 lines.

The FastAPI wrapper itself. I kept it because it is where I enforce my own auth, rate limits for my own users, and the business-logic layer above. Stekpad is the data source, not the user-facing API. That separation has not changed.

Monitoring and alerting. Credit thresholds, failed-run alerts, dataset size checks. Stekpad emits webhook events for all of these (credits.low, run.failed, enrichment.completed), but I still wire them into my own Grafana and Slack because that is where my on-call already looks.

A couple of things that surprised me

Dataset storage is free to read. I kept expecting to be charged for query_dataset calls. I was not. Agents and scripts can re-read data as many times as they want. That changed how I think about caching.

Failed runs auto-refund. Credits for failed runs are refunded back to the workspace wallet within a few seconds, visible in the usage log. I stopped pre-building retry budgets into my cron jobs.

The MCP server was free ROI. I installed the Stekpad MCP in Claude Desktop because I was curious, and now half of my ad-hoc "find me the latest on X" queries happen in Claude instead of a terminal. Same credits, zero extra code. The full agent-building guide walks the install.

The cookie bridge broke the "headless server" habit. For the one authenticated site I still scrape (an internal admin panel), I install the Stekpad Chrome extension on my laptop and authorize the domain there. The scraping runs inside my own browser. No session cookies ever hit a server I do not own. It is slower than a headless fetch by about a second, and I do not care.

The scraping layer became a 3-line call

The part of my code that does "fetch this page, give me structured fields" used to be 80 lines across three files. Now it is this:

python
from stekpad import Stekpad
sp = Stekpad(api_key="stkpd_live_...")
row = sp.extract(url=url, schema=MY_SCHEMA).fields

Three lines. Typed. Retried. Stored. I still own the schema (MY_SCHEMA), because that is my product. I no longer own the scraping machinery, because the scraping machinery is not my product.

If your scraping stack is eating a day a week, start with one verb. scrape is the easiest. Paste a URL, see markdown come back, put the call next to the code that was doing it the hard way. If the shape fits, the other four verbs fit the same way.

Next steps

Stekpad Team
We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Try the API. Free to start.

3 free runs a day on the playground. No credit card. Install MCP for Claude in 60 seconds.

The 5 API verbs that replaced my scraping stack — Stekpad — Stekpad