Read and process immune repertoire files to immundata
Source:R/io_repertoires_read.R
read_repertoires.Rd
This is the main function for reading immune repertoire data into the
immundata
framework. It reads one or more repertoire files (AIRR TSV,
10X CSV, Parquet), performs optional preprocessing and column renaming,
aggregates sequences into receptors based on a provided schema, optionally
joins external metadata, performs optional postprocessing, and returns
an ImmunData
object.
The function handles different data types (bulk, single-cell) based on
the presence of barcode_col
and count_col
. For efficiency with large
datasets, it processes the data and saves intermediate results (annotations)
as a Parquet file before loading them back into the final ImmunData
object.
Usage
read_repertoires(
path,
schema,
metadata = NULL,
barcode_col = NULL,
count_col = NULL,
locus_col = NULL,
umi_col = NULL,
preprocess = make_default_preprocessing(),
postprocess = make_default_postprocessing(),
rename_columns = imd_rename_cols("10x"),
enforce_schema = TRUE,
metadata_file_col = "File",
output_folder = NULL,
repertoire_schema = NULL
)
Arguments
- path
Character vector. Path(s) to input repertoire files (e.g.,
"/path/to/data/*.tsv.gz"
). Supports glob patterns viaSys.glob()
. Files can be Parquet, CSV, TSV, or gzipped versions thereof. All files must be of the same type. Alternatively, pass the special string"<metadata>"
to read file paths from themetadata
table (seemetadata
andmetadata_file_col
params).- schema
Defines how unique receptors are identified. Can be:
A character vector of column names (e.g.,
c("v_call", "j_call", "junction_aa")
).A schema object created by
make_receptor_schema()
, allowing specification of chains for pairing (e.g.,make_receptor_schema(features = c("v_call", "junction_aa"), chains = c("TRA", "TRB"))
).
- metadata
Optional. A data frame containing metadata to be joined with the repertoire data, read by
read_metadata()
function. Ifpath = "<metadata>"
, this table must be provided and contain the file paths column specified bymetadata_file_col
. Default:NULL
.- barcode_col
Character(1). Name of the column containing cell barcodes or other unique cell/clone identifiers for single-cell data. Triggers single-cell processing logic in
agg_receptors()
. Default:NULL
.- count_col
Character(1). Name of the column containing UMI counts or frequency counts for bulk sequencing data. Triggers bulk processing logic in
agg_receptors()
. Default:NULL
. Cannot be specified ifbarcode_col
is also specified.- locus_col
Character(1). Name of the column specifying the receptor chain locus (e.g., "TRA", "TRB", "IGH", "IGK", "IGL"). Required if
schema
specifies chains for pairing. Default:NULL
.- umi_col
Character(1). Name of the column containing UMI counts for single-cell data. Used during paired-chain processing to select the most abundant chain per barcode per locus. Default:
NULL
.- preprocess
List. A named list of functions to apply sequentially to the raw data before receptor aggregation. Each function should accept a data frame (or duckplyr_df) as its first argument. See
make_default_preprocessing()
for examples. Default:make_default_preprocessing()
. Set toNULL
orlist()
to disable.- postprocess
List. A named list of functions to apply sequentially to the annotation data after receptor aggregation and metadata joining. Each function should accept a data frame (or duckplyr_df) as its first argument. See
make_default_postprocessing()
for examples. Default:make_default_postprocessing()
. Set toNULL
orlist()
to disable.- rename_columns
Named character vector. Optional mapping to rename columns in the input files using
dplyr::rename()
syntax (e.g.,c(new_name = "old_name", barcode = "cell_id")
). Renaming happens before preprocessing and schema application. Seeimd_rename_cols()
for presets. Default:imd_rename_cols("10x")
.- enforce_schema
Logical(1). If
TRUE
(default), reading multiple files requires them to have the exact same columns and types. IfFALSE
, columns are unioned across files (potentially slower, requires more memory). Default:TRUE
.- metadata_file_col
Character(1). The name of the column in the
metadata
table that contains the full paths to the repertoire files. Only used whenpath = "<metadata>"
. Default:"File"
.- output_folder
Character(1). Path to a directory where intermediate processed annotation data will be saved as
annotations.parquet
andmetadata.json
. IfNULL
(default), a folder namedimmundata-<basename_without_ext>
is created in the same directory as the first input file specified inpath
. The finalImmunData
object reads from these saved files. Default:NULL
.- repertoire_schema
Character vector or Function. Defines columns used to group annotations into distinct repertoires (e.g., by sample or donor). If provided,
agg_repertoires()
is called after loading to add repertoire-level summaries and metrics. Default:NULL
.
Value
An ImmunData
object containing the processed receptor annotations.
If repertoire_schema
was provided, the object will also contain repertoire
definitions and summaries calculated by agg_repertoires()
.
Details
The function executes the following steps:
Validates inputs.
Determines the list of input files based on
path
andmetadata
. Checks file extensions.Reads data using
duckplyr
(read_parquet_duckdb
orread_csv_duckdb
). Handles.gz
.Applies column renaming if
rename_columns
is provided.Applies preprocessing steps sequentially if
preprocess
is provided.Aggregates sequences into receptors using
agg_receptors()
, based onschema
,barcode_col
,count_col
,locus_col
, andumi_col
. This creates the core annotation table.Joins the
metadata
table if provided.Applies postprocessing steps sequentially if
postprocess
is provided.Creates a temporary
ImmunData
object in memory.Determines the
output_folder
path.Saves the processed annotation table and metadata using
write_immundata()
to theoutput_folder
.Loads the data back from the saved Parquet files using
read_immundata()
to create the finalImmunData
object. This ensures the returned object is backed by efficient storage.If
repertoire_schema
is provided, callsagg_repertoires()
on the loaded object to define and summarize repertoires.Returns the final
ImmunData
object.
Examples
if (FALSE) { # \dontrun{
#
# Example 1: single-chain, one file
#
# Read a single AIRR TSV file, defining receptors by V/J/CDR3_aa
# Assume "my_sample.tsv" exists and follows AIRR format
# Create a dummy file for illustration
airr_data <- data.frame(
sequence_id = paste0("seq", 1:5),
v_call = c("TRBV1", "TRBV1", "TRBV2", "TRBV1", "TRBV3"),
j_call = c("TRBJ1", "TRBJ1", "TRBJ2", "TRBJ1", "TRBJ1"),
junction_aa = c("CASSL...", "CASSL...", "CASSD...", "CASSL...", "CASSF..."),
productive = c(TRUE, TRUE, TRUE, FALSE, TRUE),
locus = c("TRB", "TRB", "TRB", "TRB", "TRB")
)
readr::write_tsv(airr_data, "my_sample.tsv")
# Define receptor schema
receptor_def <- c("v_call", "j_call", "junction_aa")
# Specify output folder
out_dir <- tempfile("immundata_output_")
# Read the data (disabling default preprocessing for this simple example)
idata <- read_repertoires(
path = "my_sample.tsv",
schema = receptor_def,
output_folder = out_dir,
preprocess = NULL, # Disable default productive filter for demo
postprocess = NULL # Disable default barcode prefixing
)
print(idata)
print(idata$annotations)
#
# Example 2: single-chain, multiple files
#
# Read multiple files using metadata
# Create dummy files and metadata
readr::write_tsv(airr_data[1:2, ], "sample1.tsv")
readr::write_tsv(airr_data[3:5, ], "sample2.tsv")
meta <- data.frame(
SampleID = c("S1", "S2"),
Tissue = c("PBMC", "Tumor"),
FilePath = c(normalizePath("sample1.tsv"), normalizePath("sample2.tsv"))
)
readr::write_tsv(meta, "metadata.tsv")
idata_multi <- read_repertoires(
path = "<metadata>",
metadata = meta,
metadata_file_col = "FilePath",
schema = receptor_def,
repertoire_schema = "SampleID", # Aggregate by SampleID
output_folder = tempfile("immundata_multi_"),
preprocess = make_default_preprocessing("airr"), # Use default AIRR filters
postprocess = NULL
)
print(idata_multi)
print(idata_multi$repertoires) # Check repertoire summary
# Clean up dummy files
file.remove("my_sample.tsv", "sample1.tsv", "sample2.tsv", "metadata.tsv")
unlink(out_dir, recursive = TRUE)
unlink(attr(idata_multi, "output_folder"), recursive = TRUE) # Get path used by function
} # }