Feature · Storage

Storage is the product.

Every scrape lands in a dataset you own. Re-query it tomorrow. Add a column. Export it. Re-run the source.

What it means

The dataset is the unit.

A dataset in Stekpad is a persistent, editable container with one of two shapes:

  • `table` — structured rows with typed columns. Use it for products, companies, jobs, articles, leads.
  • `markdown_bundle` — a collection of markdown files addressable by canonical URL. Use it for docs sites, blog archives, RAG corpora.

Every dataset has a workspace owner, a name and description, a `source_spec` (the scrape/crawl/map config that built it, persisted so you can re-run any time), a soft-deletable archive state, and a retention policy from the workspace plan.

Why storage matters

Other APIs return by value.

Other scraping APIs return data by value. You get a JSON blob, you save it somewhere, you forget the request. Next week you want to add a column → you re-scrape the whole list and reconcile.

Stekpad gives you the JSON blob and a row in a dataset you can re-query, edit, enrich, and export. The dataset is re-runnable from its source_spec. The columns are typed. The rows have versions. Storage is not a feature — it is the product.

How it works

What rows know about themselves.

Every row in a `table` dataset carries metadata columns:

  • _scraped_at — when it was first written
  • _scraped_version — increments on re-scrape if content changed
  • _changed_at — when the content hash last changed
  • _source_run_id — the run that produced it
  • _content_hash — for change detection

This is what makes change monitoring (Cloud Starter+) possible: re-scrape, compare hash, fire `row.changed` webhook.

Examples

What you can do with a dataset.

  • Query rows — REST `GET /v1/datasets/:id/rows?filter=...`, MCP `query_dataset`. Free.
  • Add a column manually — paste a value, write a formula, run an enricher.
  • Re-enrich — kick off any of the 19 enrichers on the rows.
  • Re-run the source — replay the original scrape/crawl with one click.
  • Export — CSV, JSON, Markdown bundle zip, Google Sheets live sync.
  • Pipe to a webhook — `row.added`, `row.changed`, `enrichment.completed` events.
Create, append, query a dataset
bash
# Create a dataset implicitly by scraping into one
curl -X POST https://api.stekpad.com/v1/scrape \
-H "Authorization: Bearer stkpd_live_..." \
-d '{
"url": "https://example.com/product/42",
"dataset": { "type": "table", "name": "Example products" }
}'
 
# Append more rows to the same dataset
curl -X POST https://api.stekpad.com/v1/scrape \
-H "Authorization: Bearer stkpd_live_..." \
-d '{
"url": "https://example.com/product/43",
"dataset": { "id": "ds_abc", "mode": "append" }
}'
 
# Query the dataset
curl https://api.stekpad.com/v1/datasets/ds_abc/rows?limit=10 \
-H "Authorization: Bearer stkpd_live_..."
Retention

By plan.

PlanRetention
Free7 days
Packs30 days
Cloud Starter90 days
Cloud Growth1 year
Cloud ScaleUnlimited

Retention is per-dataset, inherited from the workspace plan at the time of creation.

FAQ

Common questions.

Can I disable storage?

Yes — pass `persist: false` on any verb. The response still contains the data; nothing is stored.

Can I rename a column?

Yes, from the dashboard. The underlying type stays.

Can I convert a `table` to a `markdown_bundle`?

No — type is immutable. A table can have a content_markdown column, which is the path for users who want both.

How are rows deduplicated?

By canonical URL by default. Override with `primary_key: ["sku", "region"]` at dataset creation.

Every scrape, in a dataset you own.

Sign up free. 300 credits a month. Re-runnable from the source_spec.

Dataset storage — every scrape lands in a place you can re-query — Stekpad