F.A.Q.
Performance – speed, memory
-
Q: My workflows are slow! But you promised the new
immunarchandimmundatawould be faster!A: There are a few cases where
immunarch 1.0(powered byimmundata) may be slower thanimmunarch 0.9or other tools.-
If your dataset is small, the overhead of managing DuckDB can be larger than the computation itself.
-
Remember immutability? If you don't create snapshots to cache results, the entire pipeline (including file reads) will run again. That may be why it feels slower. Create snapshots with
immundata::write_immundata()after heavy steps. For example, when you're done annotating with scRNA-seq clusters or other metadata, save the updatedImmunDatato disk so those steps don't rerun.
You can run
duckplyr::explain(idata$annotations)to see the query plan. If it's long and hard to scroll, that's a good sign you should snapshot.It can feel inconvenient at first, but it gets easier. If anything is unclear, open an issue, and I'll add tutorials and best-practice guides. The benefits of this
dplyr-like, immutable workflow are worth it. -
Data – reading, writing, formats, engineering
-
Q: Why do we need
ImmunDatawhen there areAnnDataandMuData?A:
ImmunDatadoesn’t try to replace AnnData or MuData. It solves a different problem. AnnData/MuData are great for cell × gene data.ImmunDatais built for AIRR data-things like receptors, clonotypes, chains, and repertoires.-
Receptor-first and AIRR-aligned.
ImmunDatafollows the AIRR Community schema (with a small superset). Field names are stable and clear, so you can share data and methods across tools without guessing what a column means. -
Aggregation is a core feature. You can define a repertoire on the fly (by sample, tissue, timepoint, therapy, etc.) and re-aggregate quickly. Counts and proportions then become well defined per repertoire, which is essential for clonality, diversity, and publicity.
-
Works for bulk and single-cell. It handles bulk repertoires and scAIRR, supports multiple chains per barcode, pairing rules, and explicit clonotype definitions. When you need to show results in Seurat or AnnData, you simply join by
barcodeand, for example, colour a UMAP with receptor labels. -
Modern data engineering by default. Data live in Parquet/Arrow, queries run through DuckDB, and we use lazy evaluation. This means the pipeline scales beyond RAM and only materialises when needed. With snapshots (immutability) you can freeze expensive steps for speed and reproducibility.
-
Ready for ML/DL. It’s straightforward to build feature tables (V/J usage, k-mers, CDR3 embeddings, etc.) and export them as Parquet for training in Python, Julia, or R—no messy conversions.
-
Interoperable, not overlapping. Keep gene-expression matrices in Seurat/AnnData. Keep AIRR receptors in
ImmunData. Join by barcode when you want the two worlds together. Each tool does what it’s best at. -
Focused scope. Heavy or niche analyses can live in small extension packages. The core stays lean, stable, and easier to install.
When not to use it: For very small datasets (where Arrow/DuckDB overhead may dominate) or for workflows that never touch AIRR concepts (pure transcriptomics).
In short: AnnData/MuData manage cells and genes.
ImmunDatamanages receptors and repertoires. They work best together. -
-
Q: How does
immundataworks under the hood, in simpler terms?A: Picture a three-layer sandwich:
-
Parquetfiles inArrowcolumn-compressed format hold the raw tables on the disk. -
DuckDBis an in-process SQL engine that can query those files without loading them fully into RAM. -
duckplyrgluesdplyrverbs (filter, mutate, summarise, …) to DuckDB SQL, so your R code looks exactly like a tidyverse pipeline while the heavy lifting happens in C++.
When you call
read_repertoires(),immundatareadsParquetfiles, registers them withDuckDB, and returns aduckplyrtable – a data frame-like structure that is in reality a lightweight database connection toDuckdDB. Every later operation is lazily translated into SQL - a special language for quering databases. Nothing is materialised (i.e., computed and output) until a workflow step truly needs physical data (e.g. a plot or an algorithm that exists only in R).References:
-
-
Q: Why do you need to create Parquet files with receptors and annotations?
A: Those are intermediate files, optimized for future data operations, and working with them significantly accelerates
immundata. -
Q: Why does
immundatasupport only the AIRR-C file format?!A: The short answer is because a single, stable schema beats a zoo of drifting ones.
The practical answer is that
immundataallows some optionality – you can provide column names for barcodes, etc.The long answer is that the amount of investments required not only for the development, but also for the continued support of parsers for different formats, is astonishing. I developed parsers for 10+ formats for
tcR/immunarchpackages. I would much prefer for upstream tool developers not to change their format each minor version, breaking pretty much all downstream pipelines and causing all sorts of pain to end users and tools developers – mind you, without bearing a responsibility to at least notify, but ideally fix the broken formats they introduced. The time of the Wild West is over. The AIRR community did an outstanding job creating its standard. Please urge the creators of your favourite tools or fellow developers to use this format or a superset, likeimmundatadoes.immundatadoes not and will not explicitly support other formats. This is both a practical stance and communication of crucial values, put intoimmundataas part of a broader ecosystem of AIRR tools. The domain is already too complex, and we need to work together to make this complexity manageable. A healthy ecosystem is not the same as a complex ecosystem. -
Q: What do I do then? I really need the output from package
Xinimmunarch...A: In the previous question, I outlined the motivation behind focusing on AIRR-C format only.
There are four practical steps that you can do:
-
(recommended) Ask the maintainer of the package
Xto support the AIRR-C file format. It is not that hard to implement. -
Use
read_repertoiresfromimmundataand create your own reader. Helpful links:read_repertoiresreference and the full documentation website. -
Use the community-driven package
work-in-progressto read files as it provides more parsers. Link: TBD -
If
work-in-progressdoesn't support the desired file format, you can create your own parser usingread_repertoiresand contribute it towork-in-progress. I personally would greatly appreciate it.
-
-
Q: Why are the counts for receptors available only after all the aggregation?
A: Counts and proportions are properties of a receptor inside a specific repertoire. A receptor seen in two samples will be counted twice – once per repertoire. Until receptors and repertoires are defined, any "count" would be ambiguous. That's why the numbers appear only after
agg_receptors()andagg_repertoires()have locked those definitions in. -
Q: You filter out non-productive receptors. How do I explore them?
A: Disable filtering of non-productive receptors in the
preprocessstep ofread_repertoires(). -
Q: Why does
immundatahave its own column names for receptors and repertoires? Could you just use the AIRR format - repertoire_id etc.?A: The power of
immundatais fast re-aggregation, which lets you define what repertoire means on the fly viaagg_repertoires(). That's why we use a superset of AIRR field names, which is totally acceptable as per their documentation. -
Q: What do I do with following error: "Error in
compute_parquet()at [...]: ! ?*A: It means that your repertoire files have different schemas, i.e., different column names. You have two options.
Option 1: Check the data and fix the schema. Explore the reason why the data have different schemas. Remove wrong files. Change column names. And try again.
Option 2: If you know what you are doing, pass argument
enforce_schema = FALSEtoread_repertoires. The resultant table will have NAs in the place of missing values. But don't use it without considering the first option. Broken schema usually means that there are some issues in how the data were processed upstream.
Interface – functions, methods, naming
-
Q: Why all the function names or ImmunData fields are so long? I want to write
idata$recinstead ofidata$receptors.A: Two main reasons: readability and encouraging autocomplete. Use your IDE's Tab to trigger autocomplete; it can speed you up 10–20×.
-
Q: Why is it so complex? Why do we need to use
dplyrinstead of plain R?
A: The short answer is: - faster computations; - code that is easy to maintain and support by other humans; - better data skills by thinking in immutable transformations; - in most cases you don't need complex transformations, so we can optimise ~95% of AIRR operations behind the scenes.
-
Q: How do I use
dplyroperations thatduckplyrdoesn't support yet?A: Let's consider several use cases.
Case 0. You are missing
group_byfromdplyr. Usesummarise(.by = ???, ...).Case 1. Your data can fit into RAM. Call
collectand usedplyr.Case 2. Your data won't fit into RAM but you must run a heavy operation on all rows. You can rewrite functions in SQL. You can break it into supported pieces (e.g. pre-filter, pre-aggregate) that DuckDB can stream, write an intermediate Parquet with
compute_parquet(), then loop over chunks, collect them, and run the analysis.Case 3. Your data won't fit into RAM, but before running intensive computations, you are open to working with smaller dataset first. Filter down via
slice_head(n=...), iterate until the code works, then run the same pipeline on the full dataset.
Miscellaneous
-
Q:
immundatais too verbose, I'm tired of all the messages. How to turn them off?A: Run the following code
options(rlib_message_verbosity = "quiet")in the beginning of your R session to turn off messages. -
Q: I don't want to use
pak, how can I use the good oldinstall.packagesordevtools?A: Nothing will stop you, eh? You are welcome, but I'm not responsible if something won't work due to issues with dependencies:
```r # CRAN release install.packages("immundata") library(immundata)
# GitHub release install.packages("devtools") devtools::install_github("immunomind/immundata") library(immundata)
# Development version (dev branch) devtools::install_github("immunomind/immundata", ref = "dev") library(immundata) ```