Phase 1: Ingestion
┌───────┐
│ files │
└───────┘
│
▼
read_metadata() ──── Read metadata
│
▼
read_repertoires() ──┬─ Read repertoire files (!)
│ │ ▼
│ │ Preprocess
│ │ ▼
│ │ Aggregate receptors (!)
│ │ ▼
│ │ Postprocess
│ │ ▼
│ │ Aggregate repertoires #1
│ │ ▼
│ └─ Write data on disk (!)
▼
agg_repertoires() ──── Aggregate repertoires #2
│
▼
┌───────────┐
│ ImmunData │
└───────────┘
Steps marked with (!) are non-optional.
The goal of the ingestion phase is to turn a folder of AIRR-seq files into an immutable on-disk ImmunData dataset.
-
Read metadata:
read_metadata()pulls in any sample- or donor-level information, such as therapy arm, HLA type, age, etc., and stores it in a data frame that we can pass to the main reading functionsread_repertoires. Attaching this context early means every chain you read later already "knows" which patient or time-point it belongs to.You can safely skip it if you don't have per-sample pr per-donor metadata.
-
Read repertoire files:
read_repertoires()streams Parquet/CSV/TSV files straight into DuckDB that powersImmunDataobjects. -
Preprocess:
During the read step you may pass a
preproc = recipeargument toread_repertoiresto preprocess data before aggregating receptors: drop unused columns, strip non-productive sequences, translate field names to the AIRR schema, de-duplicate contigs, etc. Because this logic is declarative, re-runs produce identical results. -
Aggregate receptors:
Receptor schema is how you define a receptor – a logical unit of analysis. The
read_repertoirescollapses chains into receptors accordingly and assigns each a stable unique identifier. -
Postprocess:
A mirror step to preprocess: a convenient hook to run QC checks, add derived fields, attach reference-gene annotations, or compute per-chain quality metrics after the dataset is ready. You can pass any number of steps which will be executed in a sequential order.
-
Aggregate repertoires #1:
If you already know how to group chains into receptors, perhaps by
"Sample"or"Donor"columns from the metadata, you can passrepertoire_schema = c("Sample")toread_repertoires(). Otherwise, skip and define repertoires later (common in single-cell workflows where you need cluster labels first). -
Write data on disk:
read_repertoiresalways persists what it just built: column-compressed Parquet parts plus a human-readable metadata in JSON. From here on, downstream steps can reopen the dataset instantly without touching the raw AIRR files again. -
Aggregate repertoires #2:
Call
agg_repertoires()later if you withheld grouping until additional annotations were available, e.g. donor + cell cluster.