R/io_repertoires_read.R
read_repertoires.Rd
This is the main function for reading immune repertoire data into the
immundata
framework. It reads one or more repertoire files (AIRR TSV,
10X CSV, Parquet), performs optional preprocessing and column renaming,
aggregates sequences into receptors based on a provided schema, optionally
joins external metadata, performs optional postprocessing, and returns
an ImmunData
object.
The function handles different data types (bulk, single-cell) based on
the presence of barcode_col
and count_col
. For efficiency with large
datasets, it processes the data and saves intermediate results (annotations)
as a Parquet file before loading them back into the final ImmunData
object.
read_repertoires(
path,
schema,
metadata = NULL,
barcode_col = NULL,
count_col = NULL,
locus_col = NULL,
umi_col = NULL,
preprocess = make_default_preprocessing(),
postprocess = make_default_postprocessing(),
rename_columns = imd_rename_cols("10x"),
enforce_schema = TRUE,
metadata_file_col = "File",
output_folder = NULL,
repertoire_schema = NULL
)
Character vector. Path(s) to input repertoire files (e.g.,
"/path/to/data/*.tsv.gz"
). Supports glob patterns via Sys.glob()
.
Files can be Parquet, CSV, TSV, or gzipped versions thereof. All files
must be of the same type.
Alternatively, pass the special string "<metadata>"
to read file paths
from the metadata
table (see metadata
and metadata_file_col
params).
Defines how unique receptors are identified. Can be:
A character vector of column names (e.g., c("v_call", "j_call", "junction_aa")
).
A schema object created by make_receptor_schema()
, allowing specification
of chains for pairing (e.g., make_receptor_schema(features = c("v_call", "junction_aa"), chains = c("TRA", "TRB"))
).
Optional. A data frame containing
metadata to be joined with the repertoire data, read by
read_metadata()
function. If path = "<metadata>"
, this table must
be provided and contain the file paths column specified by metadata_file_col
.
Default: NULL
.
Character(1). Name of the column containing cell barcodes
or other unique cell/clone identifiers for single-cell data. Triggers
single-cell processing logic in agg_receptors()
. Default: NULL
.
Character(1). Name of the column containing UMI counts or
frequency counts for bulk sequencing data. Triggers bulk processing logic
in agg_receptors()
. Default: NULL
. Cannot be specified if barcode_col
is also
specified.
Character(1). Name of the column specifying the receptor chain
locus (e.g., "TRA", "TRB", "IGH", "IGK", "IGL"). Required if schema
specifies chains for pairing. Default: NULL
.
Character(1). Name of the column containing UMI counts for
single-cell data. Used during paired-chain processing to select the most
abundant chain per barcode per locus. Default: NULL
.
List. A named list of functions to apply sequentially to the
raw data before receptor aggregation. Each function should accept a
data frame (or duckplyr_df) as its first argument. See
make_default_preprocessing()
for examples.
Default: make_default_preprocessing()
. Set to NULL
or list()
to disable.
List. A named list of functions to apply sequentially to the
annotation data after receptor aggregation and metadata joining. Each
function should accept a data frame (or duckplyr_df) as its first argument.
See make_default_postprocessing()
for examples.
Default: make_default_postprocessing()
. Set to NULL
or list()
to disable.
Named character vector. Optional mapping to rename columns
in the input files using dplyr::rename()
syntax (e.g.,
c(new_name = "old_name", barcode = "cell_id")
). Renaming happens before
preprocessing and schema application. See imd_rename_cols()
for presets.
Default: imd_rename_cols("10x")
.
Logical(1). If TRUE
(default), reading multiple files
requires them to have the exact same columns and types. If FALSE
, columns
are unioned across files (potentially slower, requires more memory).
Default: TRUE
.
Character(1). The name of the column in the metadata
table that contains the full paths to the repertoire files. Only used when
path = "<metadata>"
. Default: "File"
.
Character(1). Path to a directory where intermediate
processed annotation data will be saved as annotations.parquet
and
metadata.json
. If NULL
(default), a folder named
immundata-<basename_without_ext>
is created in the same directory as the
first input file specified in path
. The final ImmunData
object reads
from these saved files. Default: NULL
.
Character vector or Function. Defines columns used to
group annotations into distinct repertoires (e.g., by sample or donor).
If provided, agg_repertoires()
is called after loading to add repertoire-level
summaries and metrics. Default: NULL
.
An ImmunData
object containing the processed receptor annotations.
If repertoire_schema
was provided, the object will also contain repertoire
definitions and summaries calculated by agg_repertoires()
.
The function executes the following steps:
Validates inputs.
Determines the list of input files based on path
and metadata
. Checks file extensions.
Reads data using duckplyr
(read_parquet_duckdb
or read_csv_duckdb
). Handles .gz
.
Applies column renaming if rename_columns
is provided.
Applies preprocessing steps sequentially if preprocess
is provided.
Aggregates sequences into receptors using agg_receptors()
, based on schema
, barcode_col
, count_col
, locus_col
, and umi_col
. This creates the core annotation table.
Joins the metadata
table if provided.
Applies postprocessing steps sequentially if postprocess
is provided.
Creates a temporary ImmunData
object in memory.
Determines the output_folder
path.
Saves the processed annotation table and metadata using write_immundata()
to the output_folder
.
Loads the data back from the saved Parquet files using read_immundata()
to create the final ImmunData
object. This ensures the returned object is backed by efficient storage.
If repertoire_schema
is provided, calls agg_repertoires()
on the loaded object to define and summarize repertoires.
Returns the final ImmunData
object.
if (FALSE) { # \dontrun{
#
# Example 1: single-chain, one file
#
# Read a single AIRR TSV file, defining receptors by V/J/CDR3_aa
# Assume "my_sample.tsv" exists and follows AIRR format
# Create a dummy file for illustration
airr_data <- data.frame(
sequence_id = paste0("seq", 1:5),
v_call = c("TRBV1", "TRBV1", "TRBV2", "TRBV1", "TRBV3"),
j_call = c("TRBJ1", "TRBJ1", "TRBJ2", "TRBJ1", "TRBJ1"),
junction_aa = c("CASSL...", "CASSL...", "CASSD...", "CASSL...", "CASSF..."),
productive = c(TRUE, TRUE, TRUE, FALSE, TRUE),
locus = c("TRB", "TRB", "TRB", "TRB", "TRB")
)
readr::write_tsv(airr_data, "my_sample.tsv")
# Define receptor schema
receptor_def <- c("v_call", "j_call", "junction_aa")
# Specify output folder
out_dir <- tempfile("immundata_output_")
# Read the data (disabling default preprocessing for this simple example)
idata <- read_repertoires(
path = "my_sample.tsv",
schema = receptor_def,
output_folder = out_dir,
preprocess = NULL, # Disable default productive filter for demo
postprocess = NULL # Disable default barcode prefixing
)
print(idata)
print(idata$annotations)
#
# Example 2: single-chain, multiple files
#
# Read multiple files using metadata
# Create dummy files and metadata
readr::write_tsv(airr_data[1:2, ], "sample1.tsv")
readr::write_tsv(airr_data[3:5, ], "sample2.tsv")
meta <- data.frame(
SampleID = c("S1", "S2"),
Tissue = c("PBMC", "Tumor"),
FilePath = c(normalizePath("sample1.tsv"), normalizePath("sample2.tsv"))
)
readr::write_tsv(meta, "metadata.tsv")
idata_multi <- read_repertoires(
path = "<metadata>",
metadata = meta,
metadata_file_col = "FilePath",
schema = receptor_def,
repertoire_schema = "SampleID", # Aggregate by SampleID
output_folder = tempfile("immundata_multi_"),
preprocess = make_default_preprocessing("airr"), # Use default AIRR filters
postprocess = NULL
)
print(idata_multi)
print(idata_multi$repertoires) # Check repertoire summary
# Clean up dummy files
file.remove("my_sample.tsv", "sample1.tsv", "sample2.tsv", "metadata.tsv")
unlink(out_dir, recursive = TRUE)
unlink(attr(idata_multi, "output_folder"), recursive = TRUE) # Get path used by function
} # }