Feature · Storage

Storage is the product.

Every scrape lands in a dataset you own. Re-query it tomorrow. Add a column. Export it. Re-run the source.

What it means

The dataset is the unit.

A dataset in Stekpad is a persistent, editable container with one of two shapes:

`table` — structured rows with typed columns. Use it for products, companies, jobs, articles, leads.
`markdown_bundle` — a collection of markdown files addressable by canonical URL. Use it for docs sites, blog archives, RAG corpora.

Every dataset has a workspace owner, a name and description, a `source_spec` (the scrape/crawl/map config that built it, persisted so you can re-run any time), a soft-deletable archive state, and a retention policy from the workspace plan.

Why storage matters

Other APIs return by value.

Other scraping APIs return data by value. You get a JSON blob, you save it somewhere, you forget the request. Next week you want to add a column → you re-scrape the whole list and reconcile.

Stekpad gives you the JSON blob and a row in a dataset you can re-query, edit, enrich, and export. The dataset is re-runnable from its source_spec. The columns are typed. The rows have versions. Storage is not a feature — it is the product.

How it works

What rows know about themselves.

Every row in a `table` dataset carries metadata columns:

_scraped_at — when it was first written
_scraped_version — increments on re-scrape if content changed
_changed_at — when the content hash last changed
_source_run_id — the run that produced it
_content_hash — for change detection

This is what makes change monitoring (Cloud Starter+) possible: re-scrape, compare hash, fire `row.changed` webhook.

Examples

What you can do with a dataset.

Query rows — REST `GET /v1/datasets/:id/rows?filter=...`, MCP `query_dataset`. Free.
Add a column manually — paste a value, write a formula, run an enricher.
Re-enrich — kick off any of the 19 enrichers on the rows.
Re-run the source — replay the original scrape/crawl with one click.
Export — CSV, JSON, Markdown bundle zip, Google Sheets live sync.
Pipe to a webhook — `row.added`, `row.changed`, `enrichment.completed` events.

Create, append, query a dataset

bash

# Create a dataset implicitly by scraping into one
curl -X POST https://api.stekpad.com/v1/scrape \
  -H "Authorization: Bearer stkpd_live_..." \
  -d '{
    "url": "https://example.com/product/42",
    "dataset": { "type": "table", "name": "Example products" }
  }'
 
# Append more rows to the same dataset
curl -X POST https://api.stekpad.com/v1/scrape \
  -H "Authorization: Bearer stkpd_live_..." \
  -d '{
    "url": "https://example.com/product/43",
    "dataset": { "id": "ds_abc", "mode": "append" }
  }'
 
# Query the dataset
curl https://api.stekpad.com/v1/datasets/ds_abc/rows?limit=10 \
  -H "Authorization: Bearer stkpd_live_..."

Retention

By plan.

Plan	Retention
Free	7 days
Packs	30 days
Cloud Starter	90 days
Cloud Growth	1 year
Cloud Scale	Unlimited

Retention is per-dataset, inherited from the workspace plan at the time of creation.

FAQ

Common questions.

Can I disable storage?

Yes — pass `persist: false` on any verb. The response still contains the data; nothing is stored.

Can I rename a column?

Yes, from the dashboard. The underlying type stays.

Can I convert a `table` to a `markdown_bundle`?

No — type is immutable. A table can have a content_markdown column, which is the path for users who want both.

How are rows deduplicated?

By canonical URL by default. Override with `primary_key: ["sku", "region"]` at dataset creation.

Every scrape, in a dataset you own.

Get an API key See pricing

Plan

Retention

Free

7 days

Packs

30 days

Cloud Starter

90 days

Cloud Growth

1 year

Cloud Scale

Unlimited