Storage & Provenance
Storage & Provenance
When a pipeline runs, two very different kinds of records are produced: the output bytes your methods write, and the provenance trail that says which method, parameters, and inputs produced them. Workflow Canvas stores these in two separate places, and understanding the split makes everything else about results, caching, and sharing fall into place.
The short version:
Output bytes live in one place only: a content-addressed cache keyed by the hash of the data itself. They are never referenced by a folder path you would browse.
The provenance trail lives in a SQLite database (
.wfc/wfc.db). It records every run and maps each output to the content hash that holds its bytes.Git tracks your method source (so a method’s code is versioned), but it never records pipeline runs.
This page explains why outputs are addressed by content instead of by path, what git does and does not capture, how the cache is shared across machines, and what you must back up to keep results recoverable. For the operational “how do I find a run’s output” recipe, see [[run-and-inspect-results]]; for why a step re-runs or hits the cache, see [[caching-and-reproducibility]].
Archiving is deferred, and indexed by the database
Hashing a file and moving it into the cache is called archiving, and it does not happen inline between pipeline steps. While a pipeline runs, each step records its output row with content_hash = NULL and leaves the bytes in staging. After the whole pipeline finishes, a single archive pass hashes every un-archived output and moves it into the cache in one batch. This keeps slow hashing I/O off the critical path between steps, which matters a lot for large-output pipelines.
Archiving runs automatically when you pass the archive option to a pipeline run, and you can also trigger it on demand with wfc cache archive, which finds every output still carrying a NULL hash and archives it.
The honest caveat to understand here: the cache is a flat pile of content-addressed blobs with no human-readable names. The only thing that maps a meaningful run and output back to the right blob is the SQLite database at .wfc/wfc.db — and that database is not tracked in git. If you delete or lose .wfc/, the blobs in the cache become anonymous and unrecoverable even though the bytes are still on disk. Backing up .wfc/ is required to keep archived outputs usable. The cache pruning command refuses to remove blobs for runs whose outputs have not yet been archived, so you cannot accidentally prune away data that exists only in staging — but it cannot protect you from losing the database index itself.
What git tracks (and what it doesn’t)
Git’s role here is narrow and deliberate. When you register a method, Workflow Canvas snapshots that method’s source files into the project and auto-commits that snapshot to git. Your method code is therefore versioned, and the commit SHA is captured as audit metadata on the method version row. (Cache validity itself is driven by a content fingerprint of the source files, not by the commit, so an unrelated commit elsewhere in the repo does not invalidate cached results — see [[caching-and-reproducibility]].)
What git does not do: pipeline runs never produce git commits. Running a pipeline does not stage, commit, or touch your working tree. There is intentionally no --allow-dirty style escape hatch layered on top of runs — the commit-then-run discipline applies to registering methods, not to executing pipelines.
So the division of labor is: git versions method source, the SQLite database (.wfc/wfc.db) is the run and provenance record, and the cache holds output bytes. ADR-007 established this split — git for code, a database for provenance, content-addressed storage for data.
Where to go next
[[run-and-inspect-results]] — the operational recipe: find a run, read its outputs, and trace lineage.
[[caching-and-reproducibility]] — why a step re-runs versus hits the cache, and how the lineage chain is recorded.
[[project-anatomy]] — the
.wfc/layout and thewf-canvas.tomlconfig, including the[dvc]remote block.[[registration]] — how registering a method snapshots and commits its source.