immundata
β A unified data layer for large-scale single-cell, spatial and bulk immunomics in R
immundata
introduces the ImmunData
data structure β think AnnData or SeuratObject but for immune repertoires β so you can:
Store tens of millions of immune receptors plus metadata in one place;
Compute receptor- and repertoire-level statistics leveraging single-cell, spatial, immunogenicity or any other annotations;
Work seamlessly with datasets that donβt fit in memory;
Run the same workflow on a laptop, server, or cloud instance.
immundata
?
Modern immunomics no longer ends at a couple of FASTQ files and a bar plot:
We now blend bulk AIRR-seq, single-cell V(D)J + GEX, spatial transcriptomics, clinical metadata and public databases β often inside the same analysis notebook;
Pipelines that handle gigabytes today face deca-gigabytes after the next experiment;
The same immune repertoire dataset must power multiple plots, dashboards, deep learning models and be reproducible months (years, ideally) later.
immundata
is the data-engineering backbone powered by Arrow, DuckDB, and duckplyr. It lets you scale, mix and, ultimately, analyse annotated AIRR data without rewriting your biology workflow from scratch each time the dataset grows 10Γ.
[!IMPORTANT] This README is huge. Iβm not kidding. Please consider using navigation.
immundata
?
[!WARNING]
immundata
is still in the 0.x series. Until we reach 1.0.0, breaking changes may appear in any minor/patch update (e.g.Β 0.2.1 β 0.3.0). When you attach the package, sometimes youβll see startup messages summarising the most important changes β please read them. If something that used to work suddenly fails, check the updated documentation (?function_name
) first.Tip: if your analysis depends on a specific behaviour, pin the exact version with
renv
or usepak
for installation:pak::pkg_install("immunomind/immundata@0.2.1")
Iβll keep publishing tagged releases with full docs so you can always roll back if needed.
Before installing any release or pre-release version of immundata
, please install pak
that will simplify the installation of any package, not just immundata
:
install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch))
More info if needed is available on pak website.
To install the latest release of immundata
, simply run:
pak::pkg_install("immunomind/immundata")
Mind that this will install the package from our GitHub instead of CRAN. This method is much preferred due to limitations of CRAN and reliance on other packages, which are distributed via pak
as well.
We will periodically release immundata
on CRAN. To install it from CRAN, run
pak::pkg_install("immundata")
If you are willing to try unstable yet bleeding edge features, or if there are some hot fix for your open GitHub ticket, please install the development version:
pak::pkg_install("immunomind/immundata@dev")
[!TIP] Interested in specific use cases, e.g., analyse cell clusters from single-cell data, work with paired-chain data, search for matches in databases with condition-associated TCR and BCR data?
Take a look at the π§© Use Cases section below.
Use the immune repertoire data packaged with immundata
for quick dive. Replace system.file
calls with your local file paths to run the code on your data.
library(immundata)
# Metadata table with additional sample-level information
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
# Two sample files
samples <- c(
system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
# Read the metadata table
md <- read_metadata(md_path)
# Pass the file paths and the metadata table to the function to read the dataset into R
imdata <- read_repertoires(path = samples,
schema = c("cdr3_aa", "v_call"),
metadata = md,
output_folder = "./immundata-quick-start")
# Print the resultant object in the detailed yet manageable format
imdata
# Check the folder immundata created - this is where your dataset resides now
list.files("./immundata-quick-start")
# Read sections below for data analysis
immundata
splits the workflow into two clear phases:
Ingestion β convert your AIRR files into a special format saved on disk, and then read them to a tidy immundata::ImmunData
object
Transformation β explore, annotate, filter and compute on that object
Before we go into more details for each of the phase, there are three straightforward yet essential immundata
concepts to keep in mind. These concepts set it apart from data-frame-based AIRR libraries. By extension, the concepts affect how you would work with and even think about the data analysis in other packages such as immunarch
which use immundata
as a backbone for computations.
Units: chain -> barcode -> receptor
Term | In plain English | How immundata represents it | Role |
---|---|---|---|
Chain | A single V(D)J transcript (e.g.Β TRA or IGH) coming from one read or contig. | One row in the physical table idata$annotations ; retains locus , cdr3 , umis /reads and other crucial rearrangement characteristics. |
Raw data unit β atomic building block. |
Barcode / Cell | The droplet (10x), spot (Visium) or well a chain was captured in. | Column imd_barcode . |
Physical bundle β groups chains that share a capture compartment. |
Receptor | The biological receptor you analyse: a single chain or a paired set (Ξ±Ξ², Heavy-Light) from one cell. | Virtual table idata$receptors ; unique ID imd_receptor_id . |
Logical unit β minimal object for AIRR statistics. |
Repertoire | A set of receptors grouped by sample, donor, cluster, etc. | Physical table idata$repertoires ; unique ID imd_repertoire_id ; grouping columns you choose. |
Aggregate unit β higher-level grouping for comparative analysis. |
Chain is one V(D)J rearranged molecule / contig / chemistry read (e.g.Β a single TRA, TRB, IGH, IGL). This is a minimally possible data unit, a building block of everything. In case of single-chain data, chain is the same as barcode. Never changes after ingest; you can always drill back to the exact sequence.
Barcode is a physical container that stores zero, one, or many chains. In singleβcell itβs a droplet == cell; in bulk itβs the same as assembled βclonotypeβ; in spatial itβs a spot. Barcode sometimes equals to cell. It is a biological unit that βstoresβ relevant biological data and uses for aggregation of same chains and computing counts of same receptors coming from different barcodes. Inherits any perβcell / perβsample metadata you add.
Receptor is a logical unit. A logical grouping of chains that you want to treat as one biological receptor. This is a minimal unit for AIRR statistics. There are two components to it: receptor features and receptor chains, alltogether comprising a receptor schema that you define in order to do downstream analysis. Receptor features are usually CDR3 amino acid sequences or CDR3 amino acid sequences plus Variable gene segment. Receptor chains can be: single chain, Ξ±+Ξ² pair, heavy+light pair, or even all chains sharing the same CDR3/V/J.
To summarise: chains are how immundata
stores the information, barcodes bundle chains together, and receptors are the minimal units on which repertoire statistics are computed.
Aggregation: defining receptors and repertoires
The moment data leave an AIRR-assembly tool such as Cell Ranger, you are handed an ocean of individual V(D)J chains, yet every biological question you care about is phrased in terms of receptors (βthis Ξ±Ξ² TCRβ) or repertoires (βall receptors from donor A on day 30β). As with βreceptors as logical unitsβ, the underlying assumption the second concept is based upon is researchers work with rearrangements but think in receptors. immundata
formalises the climb from raw chains to those higher-order concepts through controlled aggregation β explicit, user-defined rules that transform data without obscuring their origin.
The function agg_receptors()
lets you declare what one receptor means in your study. You choose a schema β perhaps βpair chains that share a barcode and have complementary Ξ± and Ξ² lociβ or βgroup every IGH with whatever IGL shares the same CDR3 amino-acid sequence.β The function re-aggregates the data and returns a new ImmunData
object, so you keep the previous receptor definition intact; every receptor now has a stable identifier and can be traced back to its constituent chains and barcode. There is no need to touch the downstream pipeline β just change the input.
The function agg_repertoires()
states how receptors should be bundled into biologically meaningful cohorts: all receptors from a biopsy, from a therapy responder, from a single-cell-defined cluster, or any combination of metadata columns. The result is a physical idata$repertoires
table with basic statistics (numbers of chains, barcodes, and unique receptors), again preserving direct links to the receptors it aggregates.
Because these aggregation steps live in your pipeline rather than being buried inside helper functions, they deliver two major pay-offs:
Convenience with rigour: you can run high-level computations β Jaccard coefficients, diversity indices β knowing that the exact receptor definition is stored alongside the result, so you never mis-specify parameters such as "cdr3+v"
;
Provenance and data lineage by design: every receptor records every chain it contains, every repertoire records every receptor, and the full recipe is stored in the objectβs metadata. Six months later β or six reviewers later β you can trace any summary statistic back to the precise chains that produced it, enabling fully reproducible pipelines with no hidden transformations.
Pipeline-based execution: immutability and materialisation
The explicit data lineage we talked about in concepts 1 & 2 pays its dividend only if every step is re-playable. That is why immundata
treats an analysis as a pipeline of immutable transformations. Each function returns a fresh ImmunData
object, leaving the parent untouched; the full chain of objects records how the data travelled from raw chains to final statistics.
Because immunomics datasets started to regularly outgrow RAM, those objects do not live in memory by default. Their tables are persisted as column-compressed Parquet files and βmaterialisedβ β pulled into RAM β only when a computation truly needs them, typically to crunch a subset or to emit the final numbers such as repertoire-overlap indices. For a 10 GB dataset that fits in memory, this behaviour is invisible: DuckDB streams the file, you get an in-memory frame, and life goes on. For a 100 GB experiment, the same code still runs; the heavy joins spill to disk, and the intermediate results are cached so downstream steps can reuse them without recomputation.
Thinking in pipelines therefore means two things:
Cache what matters: create intermediate ImmunData
objects when you hit an expensive step, and write them to disk; the next run can pick up from there. A prime example of this a computing edit distances to some patterns or sequences.
Assume re-execution: any colleague (or future-you on a bigger cluster) should be able to rerun pipeline.R
end-to-end without interactive tinkering and arrive at the same result byte-for-byte.
All of this engineering should stay behind the curtain. Downstream packages that adopt immundata
as their backbone should expose high-level verbs such as compute_diversity()
or plot_overlap()
; the user need not touch ImmunData
, DuckDB, or Parquet. In the ideal scenario they never learn that an on-disk database powers their workflow β and they never have to.
Leave the data engineering to the data engineers (and, sadly, bioinformaticians β I feel your pain); keep your focus or the focus of your users on the biology. It is sophisticated enough already.
And now, letβs dive into how you work with immundata
.
βββββββββ
β files β
βββββββββ
β
βΌ
read_metadata() ββββ Read metadata
β
βΌ
read_repertoires() βββ¬β Read repertoire files (!)
β β βΌ
β β Preprocess
β β βΌ
β β Aggregate receptors (!)
β β βΌ
β β Postprocess
β β βΌ
β β Aggregate repertoires #1
β β βΌ
β ββ Write data on disk (!)
βΌ
agg_repertoires() ββββ Aggregate repertoires #2
β
βΌ
βββββββββββββ
β ImmunData β
βββββββββββββ
Steps marked with (!)
are non-optional.
The goal of the ingestion phase is to turn a folder of AIRR-seq files into an immutable on-disk ImmunData
dataset.
Read metadata:
read_metadata()
pulls in any sample- or donor-level information, such as therapy arm, HLA type, age, etc., and stores it in a data frame that we can pass to the main reading functions read_repertoires
. Attaching this context early means every chain you read later already βknowsβ which patient or time-point it belongs to.
You can safely skip it if you donβt have per-sample pr per-donor metadata.
Read repertoire files:
read_repertoires()
streams Parquet/CSV/TSV files straight into DuckDB that powers ImmunData
objects.
Preprocess:
During the read step you may pass a preproc = recipe
argument to read_repertoires
to preprocess data before aggregating receptors: drop unused columns, strip non-productive sequences, translate field names to the AIRR schema, de-duplicate contigs, etc. Because this logic is declarative, re-runs produce identical results.
Aggregate receptors:
Receptor schema is how you define a receptor β a logical unit of analysis. The read_repertoires
collapses chains into receptors accordingly and assigns each a stable unique identifier.
Postprocess:
A mirror step to preprocess: a convenient hook to run QC checks, add derived fields, attach reference-gene annotations, or compute per-chain quality metrics after the dataset is ready. You can pass any number of steps which will be executed in a sequential order.
Aggregate repertoires #1:
If you already know how to group chains into receptors, perhaps by "Sample"
or "Donor"
columns from the metadata, you can pass repertoire_schema = c("Sample")
to read_repertoires()
. Otherwise, skip and define repertoires later (common in single-cell workflows where you need cluster labels first).
Write data on disk:
`read_repertoires` always persists what it just built: column-compressed Parquet parts plus a human-readable metadata in JSON. From here on, downstream steps can reopen the dataset instantly without touching the raw AIRR files again.
βββββββββββββ ββββββββββββββββββββββββββββββ
β ImmunData β β AnnData / Seurat / TCRdist β
βββββββββββββ β seur@meta.data / adata.obs β
β ββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββββ
β
βΌ
annotate_immundata() ββββ Import external annotations to ImmunData
β
βΌ
agg_repertoires() ββββ Aggregate repertoires
β
βΌ
filter_immundata() ββββ Filter receptors or repertoires
β
βΌ
mutate_immundata() ββββ Create or modify columns, compute statistics
β
β ββββββββββββββββββ
ββββββΊβ save / plot #1 β
β ββββββββββββββββββ
βΌ
annotate_immundata() ββββ Annotate ImmunData with the computed statistics
β
β ββββββββββββββββββ
ββββββΊβ save / plot #2 β
β ββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
βseur@meta.data[:] <- ...β ββββ Export ImmunData annotations
β adata.obs = ... β
ββββββββββββββββββββββββββ
Transformation is a loop of annotation β modification and computation β visualisation, always producing a new ImmunData
while leaving the parent intact. That immutability is what turns every notebook into a reproducible pipeline.
Import external annotations to ImmunData:
annotate_immundata()
(or its thin wrappers annotate_barcodes()
/ annotate_receptors()
) merges labels from Seurat/AnnData/TCRdist/anything that can be expressed as a keyed data frame to the main table, so each chain has a corresponding annotation.
Aggregate repertoires:
Now that extra labels are present, you might regroup receptors, for example, by donor Γ cell-state.
Filter receptors or repertoires:
filter_immundata()
accepts tidy-verse predicates on chains, receptors, or repertoires.
Create or modify columns, compute statistics:
On this step, you compute statistics per-repertoire or per-receptor, using input receptor features. There are several scenarios depending on what you try to achieve.
use immunarch
for the most common analysis functions. The package will automatically annotate both receptors/barcodes/chains (!) and repertoires (!!) if it is possible;
simply mutate on the whole dataset using dplyr
syntax, like compute edit distance to a specific pattern using mutate_immundata
;
more complex compute that requires a function to apply to values and is probably not supported by duckplyr
. See the π§ Advanced Topics for more details.
Save / plot #1:
Cache the ImmunData
. Use ggplot2
to visualise the statistics, computed from ImmunData
.
Annotate ImmunData with the computed statistics:
annotate_immundata()
(again) joins the freshly minted statistics back to the canonical dataset.
Save / plot #2:
Save the ImmunData
with new annotations to disk. Plot the results of analysis.
Export ImmunData annotations:
Write the annotated data back to the cell-level dataset (Seurat / AnnData) for the subsequent analysis. Additionally, you could write the ImmunData
itself to disk if needed.
immundata
provides a flexible system for loading immune receptor repertoire files from different sources β CSV, TSV and Parquet files, possibly gzipped, with some optionality. The main function for this is read_repertoires()
. Below are four ways to pass your file paths and one for convering data from existing immunarch pre-1.0
list objects with $data
and $meta
.
Pass a single file name:
If you just have one AIRR file:
library(immundata)
inp_file <- system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata")
idata <- read_repertoires(
path = inp_file,
schema = c("cdr3_aa", "v_call")
)
print(idata)
Pass a vector of file names:
For multiple files in a vector:
library(immundata)
inp_file1 <- system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata")
inp_file2 <- system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
file_vec <- c(inp_file1, inp_file2)
idata <- read_repertoires(
path = file_vec,
schema = c("cdr3_aa", "v_call")
)
print(idata)
immundata
automatically merges them (depending on your chosen schema), writes the aggregated data into a single directory of Parquet files, and produces a single-cell ImmunData
object. Think about it as a huge table instead of smaller multiple repertoire tables.
Pass a glob pattern:
If your files follow a consistent naming pattern, you can leverage shell globs:
library(immundata)
folder_with_files <- system.file("extdata/tsv", "", package = "immundata")
glob_files <- paste0(folder_with_files, "sample*.tsv")
print(glob_files)
# The output is something like "/Library/Frameworks/.../immundata/extdata/tsv/*"
# Mind the star "*" at the end
# For example, all AIRR files in the 'samples/' folder
idata <- read_repertoires(
path = glob_files,
schema = c("cdr3_aa", "v_call")
)
print(idata)
Behind the scenes, read_repertoires()
expands the glob with Sys.glob(...)
, merges the data, and produces a single ImmunData
.
Use a metadata file:
Sometimes you need more control over the data source (e.g.Β consistent sample naming, extra columns). In that case:
Load metadata with read_metadata()
.
Pass the resulting data frame to read_repertoires(path = "<metadata>", ..., metadata = md_table)
. Mind the "<metadata>"
string we pass to the function. It indicates that we should take file paths from the input metadata table.
An example code:
library(immundata)
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
md_table <- read_metadata(md_path)
print(md_table)
# The column "File" stores the file paths. If you have a different column name
# for files, use the `metadata_file_col = "<your column name>"` argument.
# A tibble: 2 Γ 5
File Therapy Response Prefix filename
<chr> <chr> <chr> <chr> <chr>
1 /.../immundata-/inst/extdβ¦ ICI FR S1_ /Users/β¦
2 /.../immundata-/inst/extdβ¦ CAR-T PR S2_ /Users/β¦
idata <- read_repertoires(
path = "<metadata>",
metadata = md_table,
schema = c("cdr3_aa", "v_call")
)
print(idata)
This approach unifies sample-level metadata (e.g.Β donor ID, timepoint) with your repertoire data inside a single ImmunData
.
You can pass the metadata table separately along with the list of files as we did in the previous examples without the β
The more information on how to work with metadata files, please read the next section.
Convert from immunarch
lists:
Pass immunarch
data lists to from_immunarch()
to create ImmunData
objects.
Metadata tables store the sample-level information. When immundata
loads the metadata, it annotates every receptor from a given sample (or file) with the corresponding metadata fields. For example, if a sample has βTherapyβ = βCARβTβ, all receptors from that sample receive the same βTherapyβ value. You can then aggregate receptors by donor, tissue, or any other field and run your analysis on those repertoires (see the next sections for aggregations).
[!WARNING] In the current version, βmetadataβ and βrepertoire schemaβ is the same, meaning you canβt get a metadata field to
idata$repertoires
if you havenβt define repertoires using that field. I will implement it in the next versions; for now, please consider usingdplyr::left_join
to merge metadata and the repertoires table together.
library(immundata)
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
md_table <- read_metadata(md_path)
Rows: 2 Columns: 4
ββ Column specification βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: "\t"
chr (4): File, Therapy, Response, Prefix
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
βΉ Found 2/2 repertoire files from the metadata on the disk
β Metadata parsed successfully
print(md_table)
# A tibble: 2 Γ 5
File Therapy Response Prefix filename
<chr> <chr> <chr> <chr> <chr>
1 /.../immundata-/inst/extdβ¦ ICI FR S1_ /Users/β¦
2 /.../immundata-/inst/extdβ¦ CAR-T PR S2_ /Users/β¦
immundata
lets you decide what a receptor means for your study by specifying:
Feature columns - which fields make a receptor unique. Usually it is "cdr3_aa"
and "v_call"
columns.
Chains to keep / pair - e.g.Β TRA only or a pair TRA + TRB.
If you have only feature columns, you can usually pass the character vector with columns to functions. In a more advanced case with multiple chain data, immundata
provides a helper function make_receptor_schema()
for building schemas:
schema <- make_receptor_schema(
features = c("cdr3_aa", "v_call"),
chains = c("TRA", "TRB")
)
Chain-agnostic
Used for bulk or pre-filtered immune repertoire data. No filtering by chain data such as TRA or TRB. Each unique combination of features in the schema vector is assigned a unique receptor identifier and counts as a receptor. In the example below, the receptor features are βcdr3_aaβ and βv_callβ columns - CDR3 amino acid sequence and V gene segment columns respectively.
library(immundata)
inp_file <- system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata")
schema <- c("cdr3_aa", "v_call")
idata <- read_repertoires(
path = inp_file,
schema = schema
)
print(idata)
Single-chain
[!NOTE] Please note that single-chain option does not (!) remove multiple chains per cell - yet. In other words, you will get multiple receptors per barcode. The paired chain option filter out chains which donβt have the max number of reads or umis per barcode. So receptor numbers and sequences could differ significantly.
Used for paired-chain data such as single-cell data to focus on the analysis of immune receptors with a specific chain. The data is pre-filtered to leave the data units with the specified chain only.
library(immundata)
inp_file <- system.file("extdata/single_cell", "lt6.csv.gz", package = "immundata")
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = "TRA"
)
idata <- read_repertoires(
path = inp_file,
schema = schema,
barcode_col = "barcode",
locus_col = "locus",
preprocess = make_default_preprocessing("10x")
)
print(idata)
Paired-chain
When you want full Ξ±Ξ² (or heavyβlight) receptors, immundata can pair two chains that originate from the same barcode and keep, for each locus, the chain with the highest UMI/reads. A single unique receptor identifier is then assigned to the pair. The data is pre-filtered to loci in target chains. Within each barcodeΓlocus the the chain with max umis or reads is selected. Barcodes lacking either chain are dropped from the receptor table.
library(immundata)
inp_file <- system.file("extdata/single_cell", "lt6.csv.gz", package = "immundata")
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("TRA", "TRB")
)
idata <- read_repertoires(
path = inp_file,
schema = schema,
barcode_col = "barcode", # required for pairing
locus_col = "locus", # column that says "TRA" / "TRB"
umi_col = "umis", # choose chain with max UMIs per locus
preprocess = make_default_preprocessing("10x")
)
print(idata)
Cheat-sheet for arguments to read_repertoires
:
Situation | barcode_col |
locus_col |
umi_col |
chains |
---|---|---|---|---|
Bulk data, no locus filtering | no | no | no | omit / NULL
|
Analyse TRA only | yesΒΉ | yes | no | "TRA" |
Pair TRA+TRB, pick best chain per cell | yes | yes | yes | c("TRA","TRB") |
ΒΉ If you pass barcodes, theyβre stored but used for counting only.
To compute repertoireβlevel statistics such as geneβsegment usage, the Jaccard coefficient, or the incidence of public receptors, you first need to define a repertoire. In immundata
a repertoire is simply a group of receptors that share one or more values from annotation columns.
Just like with receptors, you can pass a schema β a character vector of column names β to specify how receptors are grouped into repertoires.
For the bulk data, usually, you rely on the metadata table. It could be useful when you want to aggregate together receptors from the same donor or tissue, and then analyse it. Or you may want to filter out non-responders to analyse the responders only.
[!NOTE] Donβt confuse grouping of immune repertoires with grouping in plots. When you define an immune repertoire, all the proportions are recomputed, and each receptor assigned a unique repertoire identifier for faster computations. You create virtual βtablesβ with immune receptors, and you can work with them separately using filters or mutations, despite that the data is still stored as a huge table with all receptors. When you plot data, you first compute statistics per defined immune repertoire, and then you group it. You can later plot or reβgroup the resulting statistics, but the order of operations matters.
The true power of regrouping repertoires opens up when you work with single-cell data.
library(immundata)
inp_file <- system.file("extdata/single_cell", "lt6.csv.gz", package = "immundata")
md_file <- system.file("extdata/single_cell", "metadata.tsv", package = "immundata")
md_table <- read_metadata(md_file)
schema <- make_receptor_schema(features = c("cdr3", "v_call"), chains = c("TRA", "TRB"))
idata <- read_repertoires(
path = inp_file,
schema = schema,
metadata = md_table,
barcode_col = "barcode", # required for pairing
locus_col = "locus", # column that says "TRA" / "TRB"
umi_col = "umis", # choose chain with max UMIs per locus
preprocess = make_default_preprocessing("10x")
)
cells_file <- system.file("extdata/single_cell", "cells.tsv.gz", package = "immundata")
cells <- readr::read_tsv(cells_file)
# External cell annotations
print(cells)
# Annotate data with cell clusters
idata <- annotate_barcodes(
idata = idata,
annotations = cells[c("cell_id", "ident")],
annot_col = "cell_id",
keep_repertoires = FALSE
)
# Aggregate repertoires to make cluster-specific repertoires
idata <- idata |> agg_repertoires(schema = "ident")
# Take a look at the $repertoires table
print(idata)
[!CAUTION] π§ Under construction. π§
Barcode prefix
Provide a column named βPrefixβ to the metadata so make_default_postprocessing()
can automatically add this prefix to barcodes to make barcodes unique in your resultant dataset.
[!CAUTION] π§ Under construction. π§
By default, read_repertoires()
writes the created Parquet files into a directory named immundata_<first filen name>
. Consider passing output_folder
to read_repertoires()
if you want to specify the output path.
ImmunData
object is a self-describing container that holds your immune repertoire dataset. ImmunData
objects has several slots accessible via <object>$<slot>
:
ImmunData$receptors
β a virtual table created on demand from $annotations
. One row per receptor as defined by your $schema_receptor
; guaranteed to have the stable key imd_receptor_id
. This is the aggregated view into your dataset, meaning that all fields from receptor features (cdr3, v_call) are unique with respect to row, i.e., each row is unique.
ImmunData$annotations
β the main table that holds all the data. One row per chain (or per cell barcode in case of single-chained data). Holds every AIRR field (cdr3, v_call, umis, etc.) plus any metadata you imported (sample_id, tissue, distances to patterns).
ImmunData$repertoires
β a physical table produced by agg_repertoires(). Each row is a repertoire (sample, donor, cluster) and carries pre-computed counts: number of receptors, barcodes, chains.
ImmunData$schema_receptor
β the recipe that says how to collapse chains into receptors: which features make them identical (e.g.Β cdr3_aa, v_call) and which loci must pair (e.g.Β Ξ±+Ξ², heavy+light, single-chain).
ImmunData$schema_repertoire
β the grouping keys used to bundle receptors into repertoires (e.g.Β sample_id, donor_id, time_point, etc.). Lets you define multiple biological layers inside one dataset (e.g.Β patient level vs.Β patient Γ cluster level) and switch between them.
You should not and you can not change those slots directly. Being a self-describing container is a harsh life, full of dangers and unwanted adventures, and it requires a constant re-evaluation of what goes there and why. immundata
functions do that internally.
Every transformation returns a new ImmunData
, so you can stash (not trash) intermediate versions on disk and reproduce any branch of the analysis.
Example:
library(immundata)
inp_files <- paste0(system.file("extdata/single_cell", "", package = "immundata"), "/*.csv.gz")
md_file <- system.file("extdata/single_cell", "metadata.tsv", package = "immundata")
md_table <- read_metadata(md_file)
cells_file <- system.file("extdata/single_cell", "cells.tsv.gz", package = "immundata")
cells <- readr::read_tsv(cells_file)
schema <- make_receptor_schema(features = c("cdr3", "v_call"), chains = c("TRB"))
idata <- read_repertoires(path = inp_files, schema = schema, metadata = md_table, barcode_col = "barcode", locus_col = "locus", umi_col = "umis", preprocess = make_default_preprocessing("10x"), repertoire_schema = "Tissue")
print(idata)
Printed ImmunData idata
:
ββ ImmunData βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Receptors: ββ
# A duckplyr data frame: 3 variables
imd_receptor_id cdr3 v_call
<int> <chr> <chr>
1 2514 CASSVHPQYF TRBV2
2 7687 CAWSGQGWGGSTDTQYF TRBV30
3 2515 CASSPRPGSTGELFF TRBV18
4 5111 CASSQGLGVSYEQYF TRBV4-1
5 7688 CASSHGQGRTGELFF TRBV7-2
6 7689 CASGLRRGRDSGANVLTF TRBV19
7 2516 CSAHRGLGNQPQHF TRBV20-1
8 5112 CASSPQGVSNQPQHF TRBV7-2
9 1 CASSSVSGNSPLHF TRBV7-9
10 5113 CASPLGALTDTQYF TRBV2
# βΉ more rows
# βΉ Use `print(n = ...)` to see more rows
ββ Annotations: ββ
# A duckplyr data frame: 23 variables
barcode locus v_call d_call j_call c_gene productive cdr3 cdr3_nt reads umis filename imd_barcode
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 AAACCTGAβ¦ TRB TRBV2 None TRBJ2β¦ TRBC2 True CASSβ¦ TGTGCCβ¦ 13736 11 /Users/β¦ LB6_AAACCTβ¦
2 AAACCTGCβ¦ TRB TRBV30 TRBD1 TRBJ2β¦ TRBC2 True CAWSβ¦ TGTGCCβ¦ 4062 5 /Users/β¦ LB6_AAACCTβ¦
3 AAACCTGCβ¦ TRB TRBV18 TRBD2 TRBJ2β¦ TRBC2 True CASSβ¦ TGTGCCβ¦ 8617 10 /Users/β¦ LB6_AAACCTβ¦
4 AAACCTGGβ¦ TRB TRBV4β¦ TRBD1 TRBJ2β¦ TRBC2 True CASSβ¦ TGCGCCβ¦ 6811 5 /Users/β¦ LB6_AAACCTβ¦
5 AAACCTGGβ¦ TRB TRBV7β¦ TRBD1 TRBJ2β¦ TRBC2 True CASSβ¦ TGTGCCβ¦ 16836 15 /Users/β¦ LB6_AAACCTβ¦
6 AAACCTGTβ¦ TRB TRBV19 TRBD2 TRBJ2β¦ TRBC2 True CASGβ¦ TGTGCCβ¦ 8805 9 /Users/β¦ LB6_AAACCTβ¦
7 AAACCTGTβ¦ TRB TRBV2β¦ TRBD1 TRBJ1β¦ TRBC1 True CSAHβ¦ TGCAGTβ¦ 8311 6 /Users/β¦ LB6_AAACCTβ¦
8 AAACGGGAβ¦ TRB TRBV7β¦ TRBD1 TRBJ1β¦ TRBC1 True CASSβ¦ TGTGCCβ¦ 3390 3 /Users/β¦ LB6_AAACGGβ¦
9 AAACGGGAβ¦ TRB TRBV7β¦ TRBD2 TRBJ1β¦ TRBC1 True CASSβ¦ TGTGCCβ¦ 4956 4 /Users/β¦ LB6_AAACGGβ¦
10 AAACGGGAβ¦ TRB TRBV2 TRBD1 TRBJ2β¦ TRBC2 True CASPβ¦ TGTGCCβ¦ 5625 4 /Users/β¦ LB6_AAACGGβ¦
# βΉ more rows
# βΉ 10 more variables: imd_chain_id <int>, imd_receptor_id <int>, imd_n_chains <dbl>, File <chr>,
# Tissue <chr>, Prefix <chr>, imd_count <dbl>, imd_repertoire_id <int>, imd_proportion <dbl>,
# n_repertoires <int>
# βΉ Use `print(n = ...)` to see more rows
ββ Receptor schema: ββ
features:
β cdr3
β v_call
chains:
β TRB
ββ Repertoire schema: ββ
β Tissue
ββ List of repertoires: ββ
# A tibble: 3 Γ 4
imd_repertoire_id Tissue n_barcodes n_receptors
<int> <chr> <dbl> <int>
1 1 Blood 4085 3976
2 2 Normal 6797 2950
3 3 Tumor 7832 3962
Before running the code in the following subsections, execute the code below. Mind that for the example purposes, the data uses the TRB locus only. Change the input receptor schema and the column names to adapt it to the paired-chain case.
library(immundata)
inp_files <- paste0(system.file("extdata/single_cell", "", package = "immundata"), "/*.csv.gz")
md_file <- system.file("extdata/single_cell", "metadata.tsv", package = "immundata")
md_table <- read_metadata(md_file)
cells_file <- system.file("extdata/single_cell", "cells.tsv.gz", package = "immundata")
cells <- readr::read_tsv(cells_file)
schema <- make_receptor_schema(features = c("cdr3", "v_call"), chains = c("TRB"))
idata <- read_repertoires(path = inp_files, schema = schema, metadata = md_table, barcode_col = "barcode", locus_col = "locus", umi_col = "umis", preprocess = make_default_preprocessing("10x"), repertoire_schema = "Tissue")
The key functions for filtering are filter()
(dplyr
-compatible) and filter_immundata()
, which are the same function with a slightly different arguments due to the necessity to comply with dplyr
interface. Repertoires are reaggregated automatically. Functions filter_receptors()
and filter_barcodes()
are used for filter receptor and barcode identifiers, correspondingly.
To filter data, you simply pass predicates like in dplyr
. Optionally, you can pass seq_options
that allow you to filter by exact sequence match, regex pattern, or sequence distances using hamming or edit/levenshtein distances. You can pass multiple patterns via patterns = c("pattern_1", "pattern_2")
.
Filter by any annotation
Chain filters together
Filter by sequence distance
idata |> filter(seq_options = make_seq_options(patterns = "CASSELAGYRGEQYF", query_col = "cdr3", method = "lev", max_dist = 3))
idata |> filter(v_call == "TRBV2", seq_options = make_seq_options(patterns = "CASSELAGYRGEQYF", query_col = "cdr3", method = "lev", max_dist = 3))
Filter by receptor identifiers
idata |> filter_receptors(c(1,2,3))
Filter by barcodes
target_bc <- cells$cell_id[1:3]
idata |> filter_barcodes(target_bc)
Filter by repertoire
The key function for annotations are annotate
and annotate_immundata
. Functions annotate_receptors()
, annotate_barcodes()
and annotate_chains()
are light-weight wrappers around annotate_immundata()
.
Annotate by any column
Annotate by receptor identifiers
Annotate by barcodes
idata2 <- annotate_barcodes(idata = idata, annotations = cells[c("cell_id", "ident")], annot_col = "cell_id", keep_repertoires = FALSE)
idata2 <- idata2 |> filter(!is.na(ident))
idata2 <- idata2 |> agg_repertoires(schema = "ident")
print(idata2)
The key functions for this are mutate
(dplyr
-compatible) / mutate_immundata
and the functions from the downstream analysis tools.
Add or transform one or several annotation columns
Add columns with sequence distance to patterns
patterns <- c("CASSVHPQYF", "CAWSGQGWGGSTDTQYF", "CASSPRPGSTGELFF")
idata |> mutate(seq_options = make_seq_options(query_col = "cdr3", patterns = patterns, method = "lev"))
idata |> mutate(seq_options = make_seq_options(query_col = "cdr3", patterns = patterns, method = "lev", name_type = "pattern"))
Modify a subset of column values
immundata
# Find receptors from several repertoires which have >60 abundance
# and get their barcodes
idata2 <- idata |> filter(imd_count >= 60)
target_chains <- idata2$annotations |> select(imd_barcode, imd_count, cdr3, v_call, Tissue) |> collect()
target_chains
# You can then annotate those receptors in your single-cell data using
# the `imd_barcode` column as a key.
immundata
# Install the latest pre-1.0 version of immunarch
# pak::pkg_install(immunomind/immunarch)
library(immunarch)
ov_heatmap <- airr_public_index(idata, "jaccard")
pheatmap::pheatmap(ov_heatmap)
clonal_space_homeo <- airr_clonality(idata, "prop")
ggplot2::ggplot(data = clonal_space_homeo) + geom_col(aes(x = Tissue, y = occupied_prop, fill = clonal_prop_bin)) + ggplot2::theme_bw()
[!TIP] Tutorial on
immundata
+immunarch
is available onimmunarch
website.Read the previous section about the analysis.
This section is still under construction.
[!CAUTION] π§ Under construction. π§
Pretty much the whole README is either about single-chain or paired-chain data.
Consider passing make_default_preprocessing("airr")
or make_default_preprocessing("10x")
to read_repertoires(..., preproc = <here>, ...)
to have convenient processing of the corresponding formats.
Pass count_col
to read_repertoires(..., count_col = "Counts", ...)
to assign counts to chains.
[!CAUTION] π§ Under construction. π§
Make sure to pass make_default_preprocessing("10x")
, locus_col
and barcode_col
to read paired-chain data. Drop the locus_col
if you want to read single-chain data only.
library(Seurat)
sdata <- ... # Load the Seurat data
idata <- ... # Load the corresponding AIRR data
# Get both GEX and metadata
smeta <- data.frame(
barcode = colnames(sdata),
cluster = Idents(sdata),
gene_MS4A1 = FetchData(sdata, "MS4A1")[,1]
)
idata <- annotate_barcodes(idata, smeta, annot_col = "barcode")
library(AnnDataR)
adata <- ... # Load the AnndataR data
idata <- ... # Load the corresponding AIRR data
ameta <- data.frame(
barcode = adata$obs_names,
cluster = adata$obs$cell_type,
gene_GCG = adata$X[, "GCG"]
)
idata <- annotate_barcodes(idata, ameta, annot_col = "barcode")
[!CAUTION] π§ Under construction. π§
TCRdist: https://github.com/kmayerb/tcrdist3
Save receptors to file and read it via TCRdist. If you need to rename columns, use dplyr
.
receptors_mater <- idata$receptors |> rename(cdr3_b_aa = cdr3) |> collect()
readr::write_csv(receptors_mater, "receptors.csv")
# Read it via TCRdist
# Run TCRdist and get neighbours
# Write neightbours to disk
# Read neighbours to R and annotate `idata` using `annotate()`
immundata
reads the data
By design, immundata
data-loading pipeline is three steps, rather than one giant function. This promotes modularity, easier debugging, and flexible usage:
read_metadata()
.
read_repertoires()
.
annotations.parquet
(cell-level data, sample metadata, etc.)read_immundata()
to return a fully instantiated ImmunData
object that uses the newly created files on the disk as a source. The helps two-fold: you donβt lose your data, and it allows immundata
to run an optimized code when necessary.ImmunData
files later with read_immundata()
.
read_immundata(path_to_immundata_folder)
where the folder contains annotations.parquet
.immundata
format.Why split it up?
immundata
files.immundata
format, you can load it in future sessions in constant time without merging or parsing again.[!CAUTION] π§ Under construction. π§
Function is supported by duckdb
- then use dd$<function_name>
Use SQL
Run a completely custom function
You can change the RAM limits in duckplyr.
immundata
even faster with data engineering tricks
You can use hive partitioning to accelerate analysis. Recommended columns are locus, V gene segment, and sequence length.
Vadim I. Nazarov β main author and developer
Vasily Tsvetkov
β¦ more to come β¦
immundata
is free to use for commercial usage as per Apache-2.0 license. However, corporate users will not get a prioritized support for immundata
- or AIRR-related issues. The priority of open-source tool immundata
is open-source science.
If you are looking for prioritized support and setting up your data pipelines, consider contacting Vadim Nazarov for commercial consulting / support options / workshops and training sessions / designing data platforms and machine learning systems for multi-omics / or anything related.
Q: Why all the function names or ImmunData fields are so long? I want to write idata$rec
instead of idata$receptors
.
A: Two major reasons β improving the code readability and motivation to leverage the autocomplete tools. Please consider using tab
for leveraging autocomplete. It accelerates things x10-20.
Q: How does immundata
works under the hood, in simpler terms?
A: Picture a three-layer sandwich:
Arrow
files on disk hold the raw tables in column-compressed Parquet.
DuckDB
is an in-process SQL engine that can query those files without loading them fully into RAM.
duckplyr
glues dplyr
verbs (filter, mutate, summarise, β¦) to DuckDB SQL, so your R code looks exactly like a tidyverse pipeline while the heavy lifting happens in C++.
When you call read_repertoires()
, immundata writes Arrow
parts, registers them with DuckDB
, and returns a duckplyr
table. Every later verb is lazily translated into SQL; nothing is materialised until a step truly needs physical data (e.g.Β a plot or an algorithm that exists only in R).
References
Q: Why do you need to create Parquet files with receptors and annotations?
A: Those are intermediate files, optimized for future data operations, and working with them significantly accelerates immundata
. I will post a benchmark soon.
Q: Why does immundata
support only the AIRR standard?!
A: The short answer is because a single, stable schema beats a zoo of drifting ones.
The practical answer is that immundata
allows some optionality β you can provide column names for barcodes, etc.
The long answer is that the amount of investments required not only for the development, but also for the continued support of parsers for different formats, is astonishing. I developed parsers for 10+ formats for tcR
/ immunarch
packages. I would much prefer for upstream tool developers not to change their format each minor version, breaking pretty much all downstream pipelines and causing all sorts of pain to end users and tools developers β mind you, without bearing a responsibility to at least notify, but ideally fix the broken formats they introduced. The time of the Wild West is over. The AIRR community did an outstanding job creating its standard. Please urge the creators of your favourite tools or fellow developers to use this format or a superset, like immundata
does.
immundata
does not and will not explicitly support other formats. This is both a practical stance and communication of crucial values, put into immundata
as part of a broader ecosystem of AIRR tools. The domain is already too complex, and we need to work together to make this complexity manageable. A healthy ecosystem is not the same as a complex ecosystem.
Q: Why is it so complex? Why do we need to use dplyr
instead of plain R?
A: The short answer is:
Q: How do I use dplyr
operations that duckplyr
doesnβt support yet?
A: Letβs consider several use cases.
Case 0. You are missing group_by
from dplyr
. Use summarise(.by = ???, ...)
.
Case 1. Your data can fit into RAM. Call collect
and use dplyr
.
Case 2. Your data wonβt fit into RAM but you must run a heavy operation on all rows. You can rewrite functions in SQL. You can break it into supported pieces (e.g.Β pre-filter, pre-aggregate) that DuckDB can stream, write an intermediate Parquet with compute_parquet()
, then loop over chunks, collect them, and run the analysis.
Case 3. Your data wonβt fit into RAM, but before running intensive computations, you are open to working with smaller dataset first. Filter down via slice_head(n=...)
, iterate until the code works, then run the same pipeline on the full dataset.
Q: You filter out non-productive receptors. How do I explore them?
A: Do not filter out non-productive receptors in preprocess
in read_repertoires()
.
Q: Why does immundata
have its own column names for receptors and repertoires? Could you just use the AIRR format - repertoire_id etc.?
A: The power of immundata
lies in the fast re-aggregation of the data, that allows to work with whatever you define as a repertoire on the fly via agg_repertoires
. Hence I use a superset of the AIRR format, which is totally acceptable as per their documentation.
Q: What do I do with following error: βError in compute_parquet()
at [β¦]: ! ?*
A: It means that your repertoire files have different schemas, i.e., different column names. You have two options.
Option 1: Check the data and fix the schema. Explore the reason why the data have different schemas. Remove wrong files. Change column names. And try again.
Option 2: If you know what you are doing, pass argument enforce_schema = FALSE
to read_repertoires
. The resultant table will have NAs in the place of missing values. But donβt use it without considering the first option. Broken schema usually means that there are some issues in the how the data were processed upstream.
Q: immundata
is too verbose, Iβm tired of all the messages. How to turn them off?
A: Run the following code options(rlib_message_verbosity = "quiet")
in the beginning of your R session to turn off messages.
Q: I donβt want to use pak
, how can I use the good old install.packages
or devtools
?
A: Nothing will stop you, eh? You are welcome, but Iβm not responsible if something wonβt work due to issues with dependencies:
# CRAN release
install.packages("immundata")
# GitHub release
install.packages(c("devtools", "pkgload"))
devtools::install_github("immunomind/immundata")
devtools::reload(pkgload::inst("immundata"))
# Development version
devtools::install_github("immunomind/immundata", ref = "dev")
devtools::reload(pkgload::inst("immundata"))
Q: Why are the counts for receptors available only after all the aggregation?
A: Counts and proportions are properties of a receptor inside a specific repertoire. A receptor seen in two samples will be counted twice β once per repertoire. Until receptors and repertoires are defined, any βcountβ would be ambiguous. Thatβs why the numbers appear only after agg_receptors()
and agg_repertoires()
have locked those definitions in.