Use case · RAG

Build a RAG corpus from any docs site.

One crawl. Clean markdown. Frontmatter on every file. Re-runnable next quarter.

The problem

Your RAG pipeline needs a clean corpus.

The docs you want are spread across a third-party docs site you don’t own. You don’t want to maintain a custom crawler, you don’t want to deal with their JavaScript, and you definitely don’t want to fight Markdown conversion at 2 a.m.

How Stekpad solves it

One `crawl` call with `dataset.type: markdown_bundle`.

Stekpad walks the site, renders every page, converts each to markdown with frontmatter, and lands them in a bundle you can export as a zip or a single concatenated file with a TOC.

Renders JavaScript (so docs sites with client-side routing work).
Strips boilerplate (nav, footer, side rail) automatically.
Keeps canonical URL, title, scraped_at, content_hash in the frontmatter.
Re-run the same `source_spec` next quarter to refresh.

Concrete example

Crawl 2,000 pages of docs.

POST /v1/crawl

bash

curl -X POST https://api.stekpad.com/v1/crawl \
  -H "Authorization: Bearer stkpd_live_..." \
  -d '{
    "url": "https://docs.example.com",
    "limits": { "max_pages": 2000, "max_depth": 6 },
    "include_paths": ["/docs/**", "/guides/**", "/reference/**"],
    "exclude_paths": ["/changelog/**"],
    "scrape_options": { "formats": ["markdown"] },
    "dataset": { "type": "markdown_bundle", "name": "Example docs corpus" },
    "webhook_url": "https://myapp.com/hooks/stekpad",
    "webhook_events": ["run.completed"]
  }'

When the run completes, export:

Export the bundle

bash

curl https://api.stekpad.com/v1/datasets/ds_xyz/export?format=zip \
  -H "Authorization: Bearer stkpd_live_..." \
  -o example-docs.zip

Each file inside has frontmatter:

quickstart.md

markdown

---
url: https://docs.example.com/guides/quickstart
title: Quickstart
scraped_at: 2026-04-14T12:34:56Z
content_hash: a1b2c3d4
---
# Quickstart
...

Pipe that into LangChain, LlamaIndex, or your own embedding script.

Cost

About 12 € for a full embedded corpus.

2,000 pages crawled = 2,000 credits (~6 € on a Pro pack)
Optionally `ai_embed` on every file = +2,000 credits (~6 €)
Total to ship a full embedded RAG corpus from a 2,000-page docs site: about 12 €.

Refresh in three months

Re-run from the source_spec.

Re-run the dataset’s `source_spec` from the dashboard or via POST /v1/datasets/ds_xyz/rerun. Stekpad re-scrapes, bumps `_version` on changed files, fires `row.changed` webhooks. Your downstream pipeline only re-embeds the diff.

Related use cases

More to explore.

Lead enrichment

Upload a list of domains. Get emails, phones, companies, socials. 19 native enrichers.

AI agents

First-class MCP for Claude, Cursor and Claude Code. Reads are free.

Price monitoring

Scheduled scrape, content-hash diff, structured webhook on every change.

FAQ

Common questions.

Will the markdown be clean?

Yes. We strip nav/footer/side rail and convert with structural fidelity. Edge cases happen on heavily-customized sites; you can post-process via `ai_clean` on the bundle.

Can I crawl a docs site behind a login?

Yes — set `use_session: "<domain>"`. The cookie bridge handles the auth.

How do I keep the corpus in sync?

Schedule the crawl on Cloud Starter+. The dataset re-runs nightly. Use `row.changed` webhooks to trigger re-embedding.

Crawl your first docs site.

300 free credits a month. No subscription. Re-runnable next quarter.

Get an API key Read the crawl docs

Use case · RAG

Build a RAG corpus from any docs site.

One crawl. Clean markdown. Frontmatter on every file. Re-runnable next quarter.

Get an API key Read the crawl docs

The problem

Your RAG pipeline needs a clean corpus.

How Stekpad solves it

One `crawl` call with `dataset.type: markdown_bundle`.

Stekpad walks the site, renders every page, converts each to markdown with frontmatter, and lands them in a bundle you can export as a zip or a single concatenated file with a TOC.

Renders JavaScript (so docs sites with client-side routing work).
Strips boilerplate (nav, footer, side rail) automatically.
Keeps canonical URL, title, scraped_at, content_hash in the frontmatter.
Re-run the same `source_spec` next quarter to refresh.

Concrete example

Crawl 2,000 pages of docs.

POST /v1/crawl

bash

curl -X POST https://api.stekpad.com/v1/crawl \
  -H "Authorization: Bearer stkpd_live_..." \
  -d '{
    "url": "https://docs.example.com",
    "limits": { "max_pages": 2000, "max_depth": 6 },
    "include_paths": ["/docs/**", "/guides/**", "/reference/**"],
    "exclude_paths": ["/changelog/**"],
    "scrape_options": { "formats": ["markdown"] },
    "dataset": { "type": "markdown_bundle", "name": "Example docs corpus" },
    "webhook_url": "https://myapp.com/hooks/stekpad",
    "webhook_events": ["run.completed"]
  }'

When the run completes, export:

Export the bundle

bash

curl https://api.stekpad.com/v1/datasets/ds_xyz/export?format=zip \
  -H "Authorization: Bearer stkpd_live_..." \
  -o example-docs.zip

Each file inside has frontmatter:

quickstart.md

markdown

---
url: https://docs.example.com/guides/quickstart
title: Quickstart
scraped_at: 2026-04-14T12:34:56Z
content_hash: a1b2c3d4
---
# Quickstart
...

Pipe that into LangChain, LlamaIndex, or your own embedding script.

Cost

About 12 € for a full embedded corpus.

2,000 pages crawled = 2,000 credits (~6 € on a Pro pack)
Optionally `ai_embed` on every file = +2,000 credits (~6 €)
Total to ship a full embedded RAG corpus from a 2,000-page docs site: about 12 €.

Refresh in three months

Re-run from the source_spec.

Related use cases

More to explore.

Lead enrichment

Upload a list of domains. Get emails, phones, companies, socials. 19 native enrichers.

AI agents

First-class MCP for Claude, Cursor and Claude Code. Reads are free.

Price monitoring

Scheduled scrape, content-hash diff, structured webhook on every change.

FAQ

Common questions.

Will the markdown be clean?

Yes. We strip nav/footer/side rail and convert with structural fidelity. Edge cases happen on heavily-customized sites; you can post-process via `ai_clean` on the bundle.

Can I crawl a docs site behind a login?

Yes — set `use_session: "<domain>"`. The cookie bridge handles the auth.

How do I keep the corpus in sync?

Schedule the crawl on Cloud Starter+. The dataset re-runs nightly. Use `row.changed` webhooks to trigger re-embedding.

Crawl your first docs site.

300 free credits a month. No subscription. Re-runnable next quarter.

Get an API key Read the crawl docs