Deep dive

Building AI agents that scrape — an MCP-first guide

An end-to-end guide to wiring an AI agent to the Stekpad MCP server, with working tool configs for Claude Desktop and Cursor.

Stekpad TeamMarch 31, 202610 min read

On this page

What MCP is, in one paragraph
Why agents need a scraping verb, not an HTTP tool
Installing Stekpad MCP in Claude Desktop
Cursor
Claude Code
The eight tools the server ships
Example 1 — an agent researching competitors
Example 2 — an agent enriching a lead list
Example 3 — an agent building a RAG corpus
Budgeting credits from inside the model
Errors the agent will actually hit
Production tips
Next steps

Most AI agents can read the web the way a goldfish reads a library. They get a static snapshot, they hallucinate a link, they call an HTTP tool that returns 4 MB of HTML and blows the context window. You do not want any of that in production.

What you want is a scraping layer your agent can call the same way it calls a calculator. A small, predictable verb surface. Typed errors. A credit budget the model can see. Results that land in a place the agent can come back to tomorrow without paying to fetch the same page twice.

That is what the Stekpad MCP server is. Eight tools. One API key. Reads are free. Writes charge credits. Every response fits in a model context.

This guide walks the whole path. What MCP is, why scraping agents need it, how to install the Stekpad server in Claude Desktop, Cursor, and Claude Code, what each of the eight tools does, three end-to-end agent recipes, how to budget credits from inside the model, how to handle the errors the agent will hit, and how to run any of this in production without waking up at 3 a.m.

What MCP is, in one paragraph

Model Context Protocol is a small JSON-RPC spec that lets a language model call external tools. A server declares tools with names, descriptions, and input schemas. A client (Claude Desktop, Cursor, Claude Code) discovers those tools, hands them to the model, and routes calls back and forth. That is the whole shape. No SDK lock-in, no custom function-calling format per vendor, no agent framework of the month. The Stekpad MCP server is a regular MCP server. If your client speaks MCP, it works.

Why agents need a scraping verb, not an HTTP tool

Let your agent call raw fetch against the open web and you get three problems in a row.

First, context bloat. One modern marketing page is 300 kB of HTML and 80 kB of tracking JavaScript. Drop that into a 200k-token context and the model starts skipping instructions. You want markdown, not HTML.

Second, error fog. fetch returns a 200 on a Cloudflare interstitial, a 200 on a soft 404, and a 200 on a cookie wall. The model has no clean way to tell that it failed. You want typed errors.

Third, no memory. The agent paid to fetch stripe.com/pricing on Monday. On Tuesday it pays again for the same page because the HTTP tool does not know what a dataset is. You want persistent dataset storage, free to read.

A scraping verb solves all three at the wiring level. Stekpad ships five of them — scrape, crawl, map, extract, search — plus three free read verbs on top of the datasets where results land.

Installing Stekpad MCP in Claude Desktop

You need an API key first. Sign up at stekpad.com, open Settings → API keys, click Create key, copy the stkpd_live_... string. The free tier gives you 300 credits every month, which is enough to run every example in this guide twice.

Now open Claude Desktop. Settings → Developer → Edit Config. Paste this block, replacing the key.

json

{
  "mcpServers": {
    "stekpad": {
      "command": "npx",
      "args": ["-y", "@stekpad/mcp"],
      "env": {
        "STEKPAD_API_KEY": "stkpd_live_..."
      }
    }
  }
}

Restart Claude Desktop. Open a new chat, click the tools icon, and you should see scrape, crawl, map, extract, search, list_datasets, get_dataset, and query_dataset listed under stekpad. If you do not, the most common cause is a stale npx cache — run npx clear-npx-cache and restart.

Cursor

Cursor reads MCP config from ~/.cursor/mcp.json. The block is identical.

json

{
  "mcpServers": {
    "stekpad": {
      "command": "npx",
      "args": ["-y", "@stekpad/mcp"],
      "env": { "STEKPAD_API_KEY": "stkpd_live_..." }
    }
  }
}

Restart Cursor, open the composer, the Stekpad tools appear under the tool picker. The long version with screenshots lives at /integrations/cursor.

Claude Code

Claude Code installs MCP servers from the command line.

bash

claude mcp add stekpad -- npx -y @stekpad/mcp
export STEKPAD_API_KEY=stkpd_live_...

Open Claude Code in a new terminal and run /mcp. You should see stekpad listed with eight tools. If you prefer the hosted HTTP MCP endpoint, connect to https://mcp.stekpad.com with Authorization: Bearer stkpd_live_....

The hosted MCP is the right call when you do not want npx running on the user machine at all — for example when you are shipping a product that embeds Claude and wants to wire Stekpad as a background tool without managing Node on the client.

The eight tools the server ships

Every verb from the REST API is an MCP tool from day one. Three read-only helpers on top give the agent a way to remember what it scraped.

scrape — 1 credit, sync under 60 seconds. One URL in, markdown or JSON or HTML or a screenshot out. Maps to POST /v1/scrape. Use it when the agent needs one page.
crawl — 1 credit per page, async, webhook or polling. Walks a whole site with include/exclude paths and canonical-URL dedupe. Returns a run_id. Maps to POST /v1/crawl.
map — 1 credit per 1,000 URLs, sync. Lists URLs without fetching bodies. Use it before a targeted crawl so the agent can decide what is worth the credits.
extract — 5 credits per URL, sync. Takes a JSON schema plus a prompt, returns validated fields with two automatic retries on schema failure. Powered by a Gemma-to-Haiku-to-Sonnet cascade through Vercel AI Gateway.
search — 5 credits plus 1 per scraped result, sync. Runs a web search (Brave Search by default), optionally scrapes the top N results in the same call. The one-shot agent verb.
list_datasets — 0 credits. Lists every dataset in your workspace.
get_dataset — 0 credits. Returns schema and row counts for one dataset.
query_dataset — 0 credits. Runs a filter + sort + limit query against a dataset and returns rows.

The rule is simple: reads are free, writes are billed. We never want an agent to hit a paywall just to remember what it scraped yesterday.

Example 1 — an agent researching competitors

The brief: find the five biggest European open-source RAG frameworks, pull their pricing or sponsorship page, and store a one-line positioning summary for each.

In Claude Desktop, with the MCP server installed, you type this once.

text

Use Stekpad to search for "european open source RAG framework",
scrape the top 5 results, and save them to a new dataset called
"RAG competitors". For each row, add a column called `positioning`
that is a one-line summary of how the project describes itself.

What happens under the hood:

Claude calls search with query: "european open source RAG framework", num_results: 5, scrape_results: true. Cost: 5 + 5 = 10 credits.
Stekpad runs the search through Brave Search, scrapes each result page server-side, returns five rows with url, title, markdown, metadata.schema_org.
Claude calls list_datasets, confirms there is no RAG competitors dataset, then writes the five rows into a new one through the normal dataset-write path.
Claude calls extract once per row with a JSON schema { positioning: string } and a prompt. Cost: 5 * 5 = 25 credits.
Claude calls query_dataset to read the final table back and prints it in the chat.

Total spend: 35 credits, about 1.7 % of a 2,000-credit PAYG Starter pack. The agent never saw an HTML blob bigger than 20 kB. It never needed a browser. Every intermediate result is sitting in a Stekpad dataset the agent can query_dataset next week for zero credits.

Example 2 — an agent enriching a lead list

The brief: I pasted 200 company domains into a dataset. Add verified emails, a tech stack, and an industry classification.

text

Open the dataset called "inbound-signups-2026-04". For every row
with a `domain` column, run find_emails, email_verify, find_tech_stack,
and then ai_classify with labels ["B2B","B2C","SaaS","eCom","agency"].
Show me the top 20 verified emails at the end.

The agent runs four enrichers in sequence on 200 rows. Each enricher costs 1 credit per row. Total: 4 × 200 = 800 credits. You can read the full enricher catalog for what each one returns, but the shape is always the same: the agent calls a single enrich run, polls until status: completed, and then reads the result with query_dataset for free.

Three things worth noticing:

No vendor keys. The agent does not need a Hunter key, an Apollo key, or a Clearbit key. Stekpad runs the 19 enrichers in-house, so your scraped data never leaves the stack.
No duplicate billing. One credit per row per enricher. No per-vendor minimums, no per-vendor rate limits, no three separate invoices.
Idempotent by default. Running the same enricher twice on the same row is a no-op unless the agent passes force: true. The model can retry as much as it wants without burning credits.

Example 3 — an agent building a RAG corpus

The brief: build a searchable corpus from the full docs of `nextjs.org`, chunked and embedded, ready to query.

text

Map nextjs.org/docs with depth 4. Crawl the result, save markdown
into a dataset called "nextjs-docs". Then run ai_clean and ai_embed
on every row. When done, give me the dataset id.

The agent sequences four verbs:

map on https://nextjs.org/docs with depth: 4. Returns ~800 URLs. Cost: 1 credit.
crawl on those URLs with formats: ["markdown"], canonical_dedupe: true, exclude_paths: ["/api-reference/edge-functions"]. Async. ~780 pages after dedupe. Cost: 780 credits.
ai_clean on every row to strip nav, footer, and cookie-banner noise. Cost: 780 credits.
ai_embed on the cleaned markdown column. Returns 1024-dim vectors from bge-m3 through Cloudflare Workers AI. Cost: 780 credits.

Total: 2,341 credits. That is one PAYG Starter pack for a searchable copy of the entire Next.js documentation, landed in a dataset the agent can query with query_dataset for free. Your RAG layer now needs two things you can write in an afternoon: a vector search call and a prompt template.

Budgeting credits from inside the model

Every MCP tool response carries a credits_charged integer and the workspace's credits_remaining. The agent can read it and decide to stop. You should tell it to.

A system prompt that actually works in production looks like this.

text

You have a credit budget of 2000 Stekpad credits for this task.
Before every tool call, check credits_remaining in the last
Stekpad response. If credits_remaining would drop below 100,
stop and ask the human to approve more spend. Prefer `map` before
`crawl` when the target site is larger than 50 pages. Never call
`extract` on more than 20 URLs without confirmation.

Three sentences, and the agent now has a honest-to-goodness budget. The model does not need a new framework to respect it — it needs a number in the tool response and a rule in the system prompt.

Errors the agent will actually hit

Real agents fail in boring, predictable ways. The Stekpad MCP server returns typed errors so the model can decide what to do without calling a human.

insufficient_credits — the wallet is empty. retry_after is null. The agent should stop and print the top-up URL.
target_blocked — Cloudflare, PerimeterX, or a rate limit. Comes with retry_after in seconds. The agent should wait, not retry instantly.
schema_validation_failed — an extract call returned JSON that did not match the schema even after two retries. The agent should either relax the schema or drop the URL.
session_unavailable — the cookie bridge for stripe.com is not connected. Comes with the exact next step in message. The agent should tell the human to open Chrome with the Stekpad extension active. The cookie bridge doc explains the full security envelope.
run_timeout — an async crawl took longer than its TTL. The agent should check get_run for partial results, which are always kept.
bad_url — the URL was malformed, an intranet host, or an unsupported scheme. No retry is useful. Drop and move on.

Every error has code, message, and optional retry_after. The model can branch on code. Never train it to branch on message — we reserve the right to rewrite those for clarity.

Production tips

A few things we learned running this server in front of real agent traffic.

Use a separate API key per agent project. Keys are per-workspace, and rotating a key is a single click in the dashboard. If an agent starts burning credits on a prompt-injection, you want to kill one key, not your whole wallet.

Cap the context size of tool responses at 20 kB. Stekpad already returns markdown, not HTML, but a huge marketing page can still cross 40 kB. Pass max_chars: 20000 on scrape to truncate.

Prefer `map` plus `scrape` over `crawl` for small sites. crawl is optimized for 50+ pages. For five pages, you save nothing and you lose the ability for the model to decide between URLs.

Pin the MCP server version in CI. npx -y @stekpad/mcp@1.2.0 is a better choice than floating latest in a production deployment. We publish a changelog at /changelog.

Trust `query_dataset` as the agent's memory. Writes are expensive, reads are free. Structure your pipelines so the expensive calls happen once and the model reads the result as many times as it wants.

Never put customer PII in a dataset you share across agents. Stekpad datasets are workspace-scoped, but inside a workspace every API key can read every dataset. Use separate workspaces for customer-specific data.

Log tool calls on your side, not just ours. Every Stekpad response carries a run_id. Persist that run_id next to the agent turn that produced it. When something goes wrong two weeks later, you can replay the exact tool call and the exact response from the Stekpad run log without grepping Claude's chat history.

Pre-warm the cookie bridge for the domains the agent actually needs. If your agent is going to hit linkedin.com or stripe.com, walk the user through adding the domain in the Chrome extension before the first run. A session_unavailable error three tool calls deep is an ugly UX. A green bridge indicator before the agent even starts is a calm one. See the cookie bridge architecture doc for the full flow.

Test with a 20-credit budget first. Before you point the agent at a 10,000-page corpus, point it at a 10-page corpus. Watch the turns. Count the tool calls. Count the credits. Multiply. If the math does not fit your budget on a toy run, it will not fit on the real one.

Next steps

Install the MCP server with the config above and run the competitor-research example. It costs 35 credits.
Read the full MCP tool reference for the exact schemas of every tool.
Read The 5 API verbs that replaced my scraping stack for the REST-side view of the same verbs.
Read Why agents need live data for the case against fine-tuning your way out of this problem.
If you are thinking about a production rollout, talk to us — we run office hours for teams deploying Stekpad behind a real agent.

Stekpad Team

We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Deep dive

Building AI agents that scrape — an MCP-first guide

An end-to-end guide to wiring an AI agent to the Stekpad MCP server, with working tool configs for Claude Desktop and Cursor.

Stekpad TeamMarch 31, 202610 min read

On this page

What MCP is, in one paragraph
Why agents need a scraping verb, not an HTTP tool
Installing Stekpad MCP in Claude Desktop
Cursor
Claude Code
The eight tools the server ships
Example 1 — an agent researching competitors
Example 2 — an agent enriching a lead list
Example 3 — an agent building a RAG corpus
Budgeting credits from inside the model
Errors the agent will actually hit
Production tips
Next steps

That is what the Stekpad MCP server is. Eight tools. One API key. Reads are free. Writes charge credits. Every response fits in a model context.

What MCP is, in one paragraph

Why agents need a scraping verb, not an HTTP tool

Let your agent call raw fetch against the open web and you get three problems in a row.

Second, error fog. fetch returns a 200 on a Cloudflare interstitial, a 200 on a soft 404, and a 200 on a cookie wall. The model has no clean way to tell that it failed. You want typed errors.

Installing Stekpad MCP in Claude Desktop

Now open Claude Desktop. Settings → Developer → Edit Config. Paste this block, replacing the key.

json

{
  "mcpServers": {
    "stekpad": {
      "command": "npx",
      "args": ["-y", "@stekpad/mcp"],
      "env": {
        "STEKPAD_API_KEY": "stkpd_live_..."
      }
    }
  }
}

Cursor

Cursor reads MCP config from ~/.cursor/mcp.json. The block is identical.

json

{
  "mcpServers": {
    "stekpad": {
      "command": "npx",
      "args": ["-y", "@stekpad/mcp"],
      "env": { "STEKPAD_API_KEY": "stkpd_live_..." }
    }
  }
}

Restart Cursor, open the composer, the Stekpad tools appear under the tool picker. The long version with screenshots lives at /integrations/cursor.

Claude Code

Claude Code installs MCP servers from the command line.

bash

claude mcp add stekpad -- npx -y @stekpad/mcp
export STEKPAD_API_KEY=stkpd_live_...

The hosted MCP is the right call when you do not want npx running on the user machine at all — for example when you are shipping a product that embeds Claude and wants to wire Stekpad as a background tool without managing Node on the client.

The eight tools the server ships

Every verb from the REST API is an MCP tool from day one. Three read-only helpers on top give the agent a way to remember what it scraped.

scrape — 1 credit, sync under 60 seconds. One URL in, markdown or JSON or HTML or a screenshot out. Maps to POST /v1/scrape. Use it when the agent needs one page.
crawl — 1 credit per page, async, webhook or polling. Walks a whole site with include/exclude paths and canonical-URL dedupe. Returns a run_id. Maps to POST /v1/crawl.
map — 1 credit per 1,000 URLs, sync. Lists URLs without fetching bodies. Use it before a targeted crawl so the agent can decide what is worth the credits.
extract — 5 credits per URL, sync. Takes a JSON schema plus a prompt, returns validated fields with two automatic retries on schema failure. Powered by a Gemma-to-Haiku-to-Sonnet cascade through Vercel AI Gateway.
search — 5 credits plus 1 per scraped result, sync. Runs a web search (Brave Search by default), optionally scrapes the top N results in the same call. The one-shot agent verb.
list_datasets — 0 credits. Lists every dataset in your workspace.
get_dataset — 0 credits. Returns schema and row counts for one dataset.
query_dataset — 0 credits. Runs a filter + sort + limit query against a dataset and returns rows.

The rule is simple: reads are free, writes are billed. We never want an agent to hit a paywall just to remember what it scraped yesterday.

Example 1 — an agent researching competitors

The brief: find the five biggest European open-source RAG frameworks, pull their pricing or sponsorship page, and store a one-line positioning summary for each.

In Claude Desktop, with the MCP server installed, you type this once.

text

Use Stekpad to search for "european open source RAG framework",
scrape the top 5 results, and save them to a new dataset called
"RAG competitors". For each row, add a column called `positioning`
that is a one-line summary of how the project describes itself.

What happens under the hood:

Claude calls search with query: "european open source RAG framework", num_results: 5, scrape_results: true. Cost: 5 + 5 = 10 credits.
Stekpad runs the search through Brave Search, scrapes each result page server-side, returns five rows with url, title, markdown, metadata.schema_org.
Claude calls list_datasets, confirms there is no RAG competitors dataset, then writes the five rows into a new one through the normal dataset-write path.
Claude calls extract once per row with a JSON schema { positioning: string } and a prompt. Cost: 5 * 5 = 25 credits.
Claude calls query_dataset to read the final table back and prints it in the chat.

Example 2 — an agent enriching a lead list

The brief: I pasted 200 company domains into a dataset. Add verified emails, a tech stack, and an industry classification.

text

Open the dataset called "inbound-signups-2026-04". For every row
with a `domain` column, run find_emails, email_verify, find_tech_stack,
and then ai_classify with labels ["B2B","B2C","SaaS","eCom","agency"].
Show me the top 20 verified emails at the end.

Three things worth noticing:

No vendor keys. The agent does not need a Hunter key, an Apollo key, or a Clearbit key. Stekpad runs the 19 enrichers in-house, so your scraped data never leaves the stack.
No duplicate billing. One credit per row per enricher. No per-vendor minimums, no per-vendor rate limits, no three separate invoices.
Idempotent by default. Running the same enricher twice on the same row is a no-op unless the agent passes force: true. The model can retry as much as it wants without burning credits.

Example 3 — an agent building a RAG corpus

The brief: build a searchable corpus from the full docs of `nextjs.org`, chunked and embedded, ready to query.

text

Map nextjs.org/docs with depth 4. Crawl the result, save markdown
into a dataset called "nextjs-docs". Then run ai_clean and ai_embed
on every row. When done, give me the dataset id.

The agent sequences four verbs:

map on https://nextjs.org/docs with depth: 4. Returns ~800 URLs. Cost: 1 credit.
crawl on those URLs with formats: ["markdown"], canonical_dedupe: true, exclude_paths: ["/api-reference/edge-functions"]. Async. ~780 pages after dedupe. Cost: 780 credits.
ai_clean on every row to strip nav, footer, and cookie-banner noise. Cost: 780 credits.
ai_embed on the cleaned markdown column. Returns 1024-dim vectors from bge-m3 through Cloudflare Workers AI. Cost: 780 credits.

Budgeting credits from inside the model

Every MCP tool response carries a credits_charged integer and the workspace's credits_remaining. The agent can read it and decide to stop. You should tell it to.

A system prompt that actually works in production looks like this.

text

You have a credit budget of 2000 Stekpad credits for this task.
Before every tool call, check credits_remaining in the last
Stekpad response. If credits_remaining would drop below 100,
stop and ask the human to approve more spend. Prefer `map` before
`crawl` when the target site is larger than 50 pages. Never call
`extract` on more than 20 URLs without confirmation.

Three sentences, and the agent now has a honest-to-goodness budget. The model does not need a new framework to respect it — it needs a number in the tool response and a rule in the system prompt.

Errors the agent will actually hit

Real agents fail in boring, predictable ways. The Stekpad MCP server returns typed errors so the model can decide what to do without calling a human.

insufficient_credits — the wallet is empty. retry_after is null. The agent should stop and print the top-up URL.
target_blocked — Cloudflare, PerimeterX, or a rate limit. Comes with retry_after in seconds. The agent should wait, not retry instantly.
schema_validation_failed — an extract call returned JSON that did not match the schema even after two retries. The agent should either relax the schema or drop the URL.
session_unavailable — the cookie bridge for stripe.com is not connected. Comes with the exact next step in message. The agent should tell the human to open Chrome with the Stekpad extension active. The cookie bridge doc explains the full security envelope.
run_timeout — an async crawl took longer than its TTL. The agent should check get_run for partial results, which are always kept.
bad_url — the URL was malformed, an intranet host, or an unsupported scheme. No retry is useful. Drop and move on.

Every error has code, message, and optional retry_after. The model can branch on code. Never train it to branch on message — we reserve the right to rewrite those for clarity.

Production tips

A few things we learned running this server in front of real agent traffic.

Cap the context size of tool responses at 20 kB. Stekpad already returns markdown, not HTML, but a huge marketing page can still cross 40 kB. Pass max_chars: 20000 on scrape to truncate.

Prefer `map` plus `scrape` over `crawl` for small sites. crawl is optimized for 50+ pages. For five pages, you save nothing and you lose the ability for the model to decide between URLs.

Pin the MCP server version in CI. npx -y @stekpad/mcp@1.2.0 is a better choice than floating latest in a production deployment. We publish a changelog at /changelog.

Next steps

Install the MCP server with the config above and run the competitor-research example. It costs 35 credits.
Read the full MCP tool reference for the exact schemas of every tool.
Read The 5 API verbs that replaced my scraping stack for the REST-side view of the same verbs.
Read Why agents need live data for the case against fine-tuning your way out of this problem.
If you are thinking about a production rollout, talk to us — we run office hours for teams deploying Stekpad behind a real agent.

Stekpad Team

We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Building AI agents that scrape — an MCP-first guide

What MCP is, in one paragraph

Why agents need a scraping verb, not an HTTP tool

Installing Stekpad MCP in Claude Desktop

Cursor

Claude Code

The eight tools the server ships

Example 1 — an agent researching competitors

Example 2 — an agent enriching a lead list

Example 3 — an agent building a RAG corpus

Budgeting credits from inside the model

Errors the agent will actually hit

Production tips

Next steps

MCP explained for growth teams (no PhD required)

Your AI agent needs live data, not yesterday's CSVs

The 5 API verbs that replaced my scraping stack

Try the API. Free to start.

Building AI agents that scrape — an MCP-first guide

What MCP is, in one paragraph

Why agents need a scraping verb, not an HTTP tool

Installing Stekpad MCP in Claude Desktop

Cursor

Claude Code

The eight tools the server ships

Example 1 — an agent researching competitors

Example 2 — an agent enriching a lead list

Example 3 — an agent building a RAG corpus

Budgeting credits from inside the model

Errors the agent will actually hit

Production tips

Next steps

MCP explained for growth teams (no PhD required)

Your AI agent needs live data, not yesterday's CSVs

The 5 API verbs that replaced my scraping stack

Try the API. Free to start.