Use case · RAG

Build a RAG corpus from any docs site.

One crawl. Clean markdown. Frontmatter on every file. Re-runnable next quarter.

The problem

Your RAG pipeline needs a clean corpus.

The docs you want are spread across a third-party docs site you don’t own. You don’t want to maintain a custom crawler, you don’t want to deal with their JavaScript, and you definitely don’t want to fight Markdown conversion at 2 a.m.

How Stekpad solves it

One `crawl` call with `dataset.type: markdown_bundle`.

Stekpad walks the site, renders every page, converts each to markdown with frontmatter, and lands them in a bundle you can export as a zip or a single concatenated file with a TOC.

  • Renders JavaScript (so docs sites with client-side routing work).
  • Strips boilerplate (nav, footer, side rail) automatically.
  • Keeps canonical URL, title, scraped_at, content_hash in the frontmatter.
  • Re-run the same `source_spec` next quarter to refresh.
Concrete example

Crawl 2,000 pages of docs.

POST /v1/crawl
bash
curl -X POST https://api.stekpad.com/v1/crawl \
-H "Authorization: Bearer stkpd_live_..." \
-d '{
"url": "https://docs.example.com",
"limits": { "max_pages": 2000, "max_depth": 6 },
"include_paths": ["/docs/**", "/guides/**", "/reference/**"],
"exclude_paths": ["/changelog/**"],
"scrape_options": { "formats": ["markdown"] },
"dataset": { "type": "markdown_bundle", "name": "Example docs corpus" },
"webhook_url": "https://myapp.com/hooks/stekpad",
"webhook_events": ["run.completed"]
}'

When the run completes, export:

Export the bundle
bash
curl https://api.stekpad.com/v1/datasets/ds_xyz/export?format=zip \
-H "Authorization: Bearer stkpd_live_..." \
-o example-docs.zip

Each file inside has frontmatter:

quickstart.md
markdown
---
url: https://docs.example.com/guides/quickstart
title: Quickstart
scraped_at: 2026-04-14T12:34:56Z
content_hash: a1b2c3d4
---
# Quickstart
...

Pipe that into LangChain, LlamaIndex, or your own embedding script.

Cost

About 12 € for a full embedded corpus.

  • 2,000 pages crawled = 2,000 credits (~6 € on a Pro pack)
  • Optionally `ai_embed` on every file = +2,000 credits (~6 €)
  • Total to ship a full embedded RAG corpus from a 2,000-page docs site: about 12 €.
Refresh in three months

Re-run from the source_spec.

Re-run the dataset’s `source_spec` from the dashboard or via POST /v1/datasets/ds_xyz/rerun. Stekpad re-scrapes, bumps `_version` on changed files, fires `row.changed` webhooks. Your downstream pipeline only re-embeds the diff.

FAQ

Common questions.

Will the markdown be clean?

Yes. We strip nav/footer/side rail and convert with structural fidelity. Edge cases happen on heavily-customized sites; you can post-process via `ai_clean` on the bundle.

Can I crawl a docs site behind a login?

Yes — set `use_session: "<domain>"`. The cookie bridge handles the auth.

How do I keep the corpus in sync?

Schedule the crawl on Cloud Starter+. The dataset re-runs nightly. Use `row.changed` webhooks to trigger re-embedding.

Crawl your first docs site.

300 free credits a month. No subscription. Re-runnable next quarter.

Build a RAG corpus — crawl any docs site to clean markdown — Stekpad