Getting Started with Workflow Canvas

What is Workflow Canvas?

Workflow Canvas is a reproducible computational pipeline system built around wfc (Workflow Canvas CLI), its CLI. It solves the problem of managing multi-step analysis pipelines where you need:

  • Contracts — Methods declare their inputs, outputs, and parameters via method.yaml. Modules enforce required outputs. The system catches wiring errors and missing outputs before you waste time on a long run.

  • Caching — Each step is fingerprinted by its source code, parameters, and upstream inputs. Unchanged steps are skipped automatically. The system refuses to run on uncommitted code (DirtyRepositoryError) to ensure reproducibility.

  • Lineage — Every run is recorded in a SQLite database. You can trace any output back through its full DAG ancestry, including cache hits which appear as audit rows in the lineage.

  • Snakemake execution — Pipelines are defined as JSON and compiled to Snakefiles. Snakemake handles parallelism and dependency resolution.

  • Canvas UI — A visual interface for building and inspecting pipelines (separate feature).

wfc manages the full lifecycle: register your modules, methods, and data samples, define a pipeline, run it, and query lineage — all from the command line.

Prerequisites

Before you begin, make sure you have:

  • Python 3.11+ — Required. The config parser uses tomllib from the standard library (added in 3.11). Installing wfc also gives you pip, which ships with Python.

  • Docker — Required. Methods always run inside a container, so a working Docker installation is a hard requirement: nothing runs without it. On Windows and macOS this means Docker Desktop; on Linux, the Docker Engine and its daemon. wfc init pre-flights Docker for you and wfc doctor checks it any time, but neither can install it — that part is on you.

  • Git — Required, but only locally. wfc records a commit for every run so results are reproducible, and it refuses to run on uncommitted code. All you need is the git command and a local identity (a name and email); wfc init even sets a repo-local fallback identity if you have none configured. There is no GitHub account, no login, and no network involved — wfc never pushes anywhere. The git requirement is purely about a clean local history.

A few things you do not need to install separately:

  • DVC ships with wfc — it is a dependency, installed automatically when you pip install workflow-canvas. wfc uses it as the content-addressed store for your run outputs and registered samples. wfc init configures a local archive for you; you never have to set DVC up by hand.

  • Snakemake also ships with wfc and runs your pipelines under the hood — wfc generates the Snakefile and invokes it for you.

Container environments are built ahead of time from a backend such as a base image, pixi, or conda, but those tools belong to the environment-build step, not to running wfc itself. See [[registering-an-environment]] for how environments are built and named.

Installation

# Install Workflow Canvas (includes the wfc CLI, plus DVC and Snakemake)
pip install workflow-canvas

# Verify the installation
wfc --help

You should see the wfc CLI help output listing available commands, including init, doctor, register-module, register-method, register-sample, register-env, and run-pipeline.

Your First Pipeline

This walkthrough covers the happy path from project initialization to pipeline execution.

1. Initialize a project

wfc init --dir ./my_project
cd my_project

wfc init is a guided setup wizard that leaves you with a project that can actually run. Run it with no extra flags and it walks you through setup interactively; the goal is that when it finishes you can register a method and run a pipeline without any further hand-configuration. It does four things:

  • Scaffolds the project structure.wfc/ (config + database), modules/, methods/, data/samples/, .runs/, and a .gitignore. The modules/ and methods/ directories start empty; you register your own in the next steps.

  • Configures a backup archive for your outputs. Every project gets one — the wizard only asks where it should live, with a sensible default you can accept by pressing Enter (~/.wfc/archives/<project>, kept outside the repo so it survives). This is wired up as a live DVC archive; you do not configure DVC yourself.

  • Sets up git. If the directory is not already a git repository, the wizard runs git init and makes a clean initial commit of the scaffold. That gives the run-gate a real starting commit and a clean tree, so your first run is not blocked by a missing HEAD or a “dirty repository” error. If you have no git identity configured, it sets a repo-local one for you so the commit always lands — you can change it later with git config.

  • Pre-flights Docker and prints a health summary at the end, so you immediately know whether your project is ready to run or what is still missing.

The wizard is idempotent: it is always safe to re-run. It never re-asks questions you have already answered and never clobbers existing config — each step checks “does this already exist?” first. That makes recovery simple: if a tool was missing, install it, re-run wfc init to finish only what is left, and run wfc doctor to confirm.

For scripts and CI, run it non-interactively:

wfc init --dir ./my_project --yes              # accept all defaults, no prompts
wfc init --dir ./my_project --archive /data/archives/my_project --yes

--yes accepts every default (including the git-identity fallback), and --archive PATH sets the output archive location without prompting.

One door for “why won’t this run?” If a run ever refuses to start, run wfc doctor. It checks git, the output archive, and Docker, prints a health table, and exits non-zero if anything is broken — handy both at your terminal and as a CI gate. When a run is blocked, wfc points you at wfc doctor rather than dumping a raw error.

About the archive. The archive stores your outputs as content-addressed blobs, indexed by the database in .wfc/ (which is deliberately not tracked in git). To keep archived outputs recoverable, back up your .wfc/ directory along with the archive folder.

Tip: Run wfc seed to populate the project with demo modules, methods, and sample data for experimentation.

2. Register a module

Modules group related methods under a domain name with output contracts. You can define contracts in a module.yaml file or pass them via CLI:

# From module.yaml (recommended):
wfc register-module --name cell_analysis --module-dir modules/cell_analysis

# Or from CLI JSON:
wfc register-module --name cell_analysis \
  --contracts '[{"type": "output", "name": "result", "value_type": ".csv", "required": true}]'

3. Register a method

Methods are individual analysis scripts. Registration AST-scans the script for public functions, parses method.yaml for contracts, validates environment resolution, checks outputs against module contracts, and git-commits the source:

# Nested method under a module:
wfc register-method modules/cell_analysis/preprocess --module cell_analysis

# Flat standalone method:
wfc register-method methods/aggregate --module my_module

A method declares the container environment it runs in (via env: in its method.yaml), and that environment must be built and registered with wfc register-env before the method can run. See [[registering-an-environment]] for the details.

Two Pythons, by design. The wfc engine runs in its own environment on the host machine; your method runs in its own container environment, which contains only your declared dependencies (and, optionally, the pure-stdlib wfc-client package). wfc never imports your method’s code and your method never imports the wfc engine — they communicate only through WFC_* environment variables and files in the run directory. Because that contract is plain env vars and files, any recorded run can be reproduced later regardless of which client version (or none) the method was authored with.

4. Register a sample

Samples are your input data. Registration content-hashes the file and stores it in the DVC cache. The data/samples/ directory is an ephemeral workspace — files are restored lazily by Snakemake at execution time, not copied at registration:

wfc register-sample --name CFPAC_ERKi --source /data/raw/cfpac_erki.csv

5. Define and run a pipeline

Pipelines are JSON files with nodes, links, and samples. Each node references a registered method; links wire outputs to named input slots:

{
  "nodes": [
    {"id": "filter_ctrl", "method": "csv_filter", "module": "csv_tools",
     "params": {"column": "condition", "values": ["control"]}}
  ],
  "links": [
    {"source": "filter_ctrl", "target": "analyze", "target_slot": "data"}
  ],
  "samples": ["CFPAC_ERKi"]
}

Run it:

wfc run-pipeline --pipeline pipeline.json --cores 4

This parses and validates the pipeline (cycle detection, slot wiring), generates a Snakefile, and executes via Snakemake. Each step checks git state, checks the cache (skipping the step on a hit), runs in its container if needed, archives output, and records the run in the database. If a run won’t start, wfc doctor will tell you why.

Next Steps

Now that you have a working pipeline, explore further:

  • [[registering-an-environment]] — Build and register the container environment your methods run in. Every method needs one, and wfc doctor checks that Docker is ready for it.

  • [[authoring-a-method-script]] — Write methods with the wfc-client decorator (recommended) or the canonical WFC_* env-var + file contract, and declare contracts in method.yaml.

  • [[writing-contracts]] — Declare and enforce the inputs and outputs that wire your pipeline together correctly.

  • [[project-anatomy]] — Understand the directory structure, the database, the config file, and how modules, methods, and runs are organized.

  • [[canvas]] — Build and inspect pipelines visually instead of by hand-editing JSON.

  • [[run-and-inspect-results]] — Find a run’s outputs and trace its lineage after it completes.