Build a RAG corpus from any docs site.
One crawl. Clean markdown. Frontmatter on every file. Re-runnable next quarter.
Your RAG pipeline needs a clean corpus.
The docs you want are spread across a third-party docs site you don’t own. You don’t want to maintain a custom crawler, you don’t want to deal with their JavaScript, and you definitely don’t want to fight Markdown conversion at 2 a.m.
One `crawl` call with `dataset.type: markdown_bundle`.
Stekpad walks the site, renders every page, converts each to markdown with frontmatter, and lands them in a bundle you can export as a zip or a single concatenated file with a TOC.
- Renders JavaScript (so docs sites with client-side routing work).
- Strips boilerplate (nav, footer, side rail) automatically.
- Keeps canonical URL, title, scraped_at, content_hash in the frontmatter.
- Re-run the same `source_spec` next quarter to refresh.
Crawl 2,000 pages of docs.
curl -X POST https://api.stekpad.com/v1/crawl \ -H "Authorization: Bearer stkpd_live_..." \ -d '{ "url": "https://docs.example.com", "limits": { "max_pages": 2000, "max_depth": 6 }, "include_paths": ["/docs/**", "/guides/**", "/reference/**"], "exclude_paths": ["/changelog/**"], "scrape_options": { "formats": ["markdown"] }, "dataset": { "type": "markdown_bundle", "name": "Example docs corpus" }, "webhook_url": "https://myapp.com/hooks/stekpad", "webhook_events": ["run.completed"] }'When the run completes, export:
curl https://api.stekpad.com/v1/datasets/ds_xyz/export?format=zip \ -H "Authorization: Bearer stkpd_live_..." \ -o example-docs.zipEach file inside has frontmatter:
---url: https://docs.example.com/guides/quickstarttitle: Quickstartscraped_at: 2026-04-14T12:34:56Zcontent_hash: a1b2c3d4---# Quickstart...Pipe that into LangChain, LlamaIndex, or your own embedding script.
About 12 € for a full embedded corpus.
- 2,000 pages crawled = 2,000 credits (~6 € on a Pro pack)
- Optionally `ai_embed` on every file = +2,000 credits (~6 €)
- Total to ship a full embedded RAG corpus from a 2,000-page docs site: about 12 €.
Re-run from the source_spec.
Re-run the dataset’s `source_spec` from the dashboard or via POST /v1/datasets/ds_xyz/rerun. Stekpad re-scrapes, bumps `_version` on changed files, fires `row.changed` webhooks. Your downstream pipeline only re-embeds the diff.
Common questions.
Will the markdown be clean?
Yes. We strip nav/footer/side rail and convert with structural fidelity. Edge cases happen on heavily-customized sites; you can post-process via `ai_clean` on the bundle.
Can I crawl a docs site behind a login?
Yes — set `use_session: "<domain>"`. The cookie bridge handles the auth.
How do I keep the corpus in sync?
Schedule the crawl on Cloud Starter+. The dataset re-runs nightly. Use `row.changed` webhooks to trigger re-embedding.
Crawl your first docs site.
300 free credits a month. No subscription. Re-runnable next quarter.