Read and process immune repertoire files to immundata

This is the main function for reading immune repertoire data into the immundata framework. It reads one or more repertoire files (AIRR TSV, 10X CSV, Parquet), performs optional preprocessing and column renaming, aggregates sequences into receptors based on a provided schema, optionally joins external metadata, performs optional postprocessing, and returns an ImmunData object.

The function handles different data types (bulk, single-cell) based on the presence of barcode_col and count_col. For efficiency with large datasets, it processes the data and saves intermediate results (annotations) as a Parquet file before loading them back into the final ImmunData object.

Usage

read_repertoires(
  path,
  schema,
  metadata = NULL,
  barcode_col = NULL,
  count_col = NULL,
  locus_col = NULL,
  umi_col = NULL,
  preprocess = make_default_preprocessing(),
  postprocess = make_default_postprocessing(),
  rename_columns = imd_rename_cols("10x"),
  enforce_schema = TRUE,
  metadata_file_col = "File",
  output_folder = NULL,
  repertoire_schema = NULL
)

Arguments

path

Character vector. Path(s) to input repertoire files (e.g., "/path/to/data/*.tsv.gz"). Supports glob patterns via Sys.glob(). Files can be Parquet, CSV, TSV, or gzipped versions thereof. All files must be of the same type. Alternatively, pass the special string "<metadata>" to read file paths from the metadata table (see metadata and metadata_file_col params).

schema

Defines how unique receptors are identified. Can be:

A character vector of column names (e.g., c("v_call", "j_call", "junction_aa")).
A schema object created by make_receptor_schema(), allowing specification of chains for pairing (e.g., make_receptor_schema(features = c("v_call", "junction_aa"), chains = c("TRA", "TRB"))).

metadata

Optional. A data frame containing metadata to be joined with the repertoire data, read by read_metadata() function. If path = "<metadata>", this table must be provided and contain the file paths column specified by metadata_file_col. Default: NULL.

barcode_col

Character(1). Name of the column containing cell barcodes or other unique cell/clone identifiers for single-cell data. Triggers single-cell processing logic in agg_receptors(). Default: NULL.

count_col

Character(1). Name of the column containing UMI counts or frequency counts for bulk sequencing data. Triggers bulk processing logic in agg_receptors(). Default: NULL. Cannot be specified if barcode_col is also specified.

locus_col

Character(1). Name of the column specifying the receptor chain locus (e.g., "TRA", "TRB", "IGH", "IGK", "IGL"). Required if schema specifies chains for pairing. Default: NULL.

umi_col

Character(1). Name of the column containing UMI counts for single-cell data. Used during paired-chain processing to select the most abundant chain per barcode per locus. Default: NULL.

preprocess

List. A named list of functions to apply sequentially to the raw data before receptor aggregation. Each function should accept a data frame (or duckplyr_df) as its first argument. See make_default_preprocessing() for examples. Default: make_default_preprocessing(). Set to NULL or list() to disable.

postprocess

List. A named list of functions to apply sequentially to the annotation data after receptor aggregation and metadata joining. Each function should accept a data frame (or duckplyr_df) as its first argument. See make_default_postprocessing() for examples. Default: make_default_postprocessing(). Set to NULL or list() to disable.

rename_columns

Named character vector. Optional mapping to rename columns in the input files using dplyr::rename() syntax (e.g., c(new_name = "old_name", barcode = "cell_id")). Renaming happens before preprocessing and schema application. See imd_rename_cols() for presets. Default: imd_rename_cols("10x").

enforce_schema

Logical(1). If TRUE (default), reading multiple files requires them to have the exact same columns and types. If FALSE, columns are unioned across files (potentially slower, requires more memory). Default: TRUE.

metadata_file_col

Character(1). The name of the column in the metadata table that contains the full paths to the repertoire files. Only used when path = "<metadata>". Default: "File".

output_folder

Character(1). Path to a directory where intermediate processed annotation data will be saved as annotations.parquet and metadata.json. If NULL (default), a folder named immundata-<basename_without_ext> is created in the same directory as the first input file specified in path. The final ImmunData object reads from these saved files. Default: NULL.

repertoire_schema

Character vector or Function. Defines columns used to group annotations into distinct repertoires (e.g., by sample or donor). If provided, agg_repertoires() is called after loading to add repertoire-level summaries and metrics. Default: NULL.

Value

An ImmunData object containing the processed receptor annotations. If repertoire_schema was provided, the object will also contain repertoire definitions and summaries calculated by agg_repertoires().

Details

The function executes the following steps:

Validates inputs.
Determines the list of input files based on path and metadata. Checks file extensions.
Reads data using duckplyr (read_parquet_duckdb or read_csv_duckdb). Handles .gz.
Applies column renaming if rename_columns is provided.
Applies preprocessing steps sequentially if preprocess is provided.
Aggregates sequences into receptors using agg_receptors(), based on schema, barcode_col, count_col, locus_col, and umi_col. This creates the core annotation table.
Joins the metadata table if provided.
Applies postprocessing steps sequentially if postprocess is provided.
Creates a temporary ImmunData object in memory.
Determines the output_folder path.
Saves the processed annotation table and metadata using write_immundata() to the output_folder.
Loads the data back from the saved Parquet files using read_immundata() to create the final ImmunData object. This ensures the returned object is backed by efficient storage.
If repertoire_schema is provided, calls agg_repertoires() on the loaded object to define and summarize repertoires.
Returns the final ImmunData object.

Examples

if (FALSE) { # \dontrun{
#
# Example 1: single-chain, one file
#
# Read a single AIRR TSV file, defining receptors by V/J/CDR3_aa
# Assume "my_sample.tsv" exists and follows AIRR format

# Create a dummy file for illustration
airr_data <- data.frame(
  sequence_id = paste0("seq", 1:5),
  v_call = c("TRBV1", "TRBV1", "TRBV2", "TRBV1", "TRBV3"),
  j_call = c("TRBJ1", "TRBJ1", "TRBJ2", "TRBJ1", "TRBJ1"),
  junction_aa = c("CASSL...", "CASSL...", "CASSD...", "CASSL...", "CASSF..."),
  productive = c(TRUE, TRUE, TRUE, FALSE, TRUE),
  locus = c("TRB", "TRB", "TRB", "TRB", "TRB")
)
readr::write_tsv(airr_data, "my_sample.tsv")

# Define receptor schema
receptor_def <- c("v_call", "j_call", "junction_aa")

# Specify output folder
out_dir <- tempfile("immundata_output_")

# Read the data (disabling default preprocessing for this simple example)
idata <- read_repertoires(
  path = "my_sample.tsv",
  schema = receptor_def,
  output_folder = out_dir,
  preprocess = NULL, # Disable default productive filter for demo
  postprocess = NULL # Disable default barcode prefixing
)

print(idata)
print(idata$annotations)

#
# Example 2: single-chain, multiple files
#
# Read multiple files using metadata
# Create dummy files and metadata
readr::write_tsv(airr_data[1:2, ], "sample1.tsv")
readr::write_tsv(airr_data[3:5, ], "sample2.tsv")
meta <- data.frame(
  SampleID = c("S1", "S2"),
  Tissue = c("PBMC", "Tumor"),
  FilePath = c(normalizePath("sample1.tsv"), normalizePath("sample2.tsv"))
)
readr::write_tsv(meta, "metadata.tsv")

idata_multi <- read_repertoires(
  path = "<metadata>",
  metadata = meta,
  metadata_file_col = "FilePath",
  schema = receptor_def,
  repertoire_schema = "SampleID", # Aggregate by SampleID
  output_folder = tempfile("immundata_multi_"),
  preprocess = make_default_preprocessing("airr"), # Use default AIRR filters
  postprocess = NULL
)

print(idata_multi)
print(idata_multi$repertoires) # Check repertoire summary

# Clean up dummy files
file.remove("my_sample.tsv", "sample1.tsv", "sample2.tsv", "metadata.tsv")
unlink(out_dir, recursive = TRUE)
unlink(attr(idata_multi, "output_folder"), recursive = TRUE) # Get path used by function
} # }