Processes a table of immune receptor sequences (chains or clonotypes) to
identify unique receptors based on a specified schema. It assigns a unique
identifier (imd_receptor_id) to each distinct receptor signature and
returns an annotated table linking the original sequence data to these
receptor IDs.
This function is a core component used within read_repertoires() and handles
different input data structures:
Simple tables (no counts, no cell IDs).
Bulk sequencing data (using a count column).
Single-cell data (using a barcode/cell ID column). For single-cell data, it can perform chain pairing if the schema specifies multiple chains (e.g., TRA and TRB).
Usage
agg_receptors(
dataset,
schema,
barcode_col = NULL,
count_col = NULL,
locus_col = NULL,
umi_col = NULL
)Arguments
- dataset
A data frame or
duckplyr_dfcontaining sequence/clonotype data. Must include columns specified inschemaand potentiallybarcode_col,count_col,locus_col,umi_col. Could beidata$annotations.- schema
Defines how a unique receptor is identified. Can be:
A character vector of column names representing receptor features (e.g.,
c("v_call", "j_call", "junction_aa")).A list created by
make_receptor_schema(), specifying bothfeatures(character vector) and optionallychains(character vector of locus names like"TRA","TRB","IGH","IGK","IGL", max length 2). Specifyingchainstriggers filtering by locus and enables pairing logic if two chains are given.
- barcode_col
Character(1). The name of the column containing cell identifiers (barcodes). Required for single-cell processing and chain pairing. Default:
NULL.- count_col
Character(1). The name of the column containing counts (e.g., UMI counts for bulk, clonotype frequency). Used for bulk data processing. Default:
NULL. Cannot be specified ifbarcode_colis set.- locus_col
Character(1). The name of the column specifying the chain locus (e.g., "TRA", "TRB"). Required if
schemaincludeschainsfor filtering or pairing. Default:NULL.- umi_col
Character(1). The name of the column containing UMI counts. Required for paired-chain single-cell data (
length(schema$chains) == 2). Used to select the most abundant chain per locus within a cell when multiple chains of the same locus are present. Default:NULL.
Value
A duckplyr_df (or data frame) representing the annotated sequences.
This table links each original sequence record (chain) to a defined receptor
and includes standardized columns:
imd_receptor_id: Integer ID unique to each distinct receptor signature.imd_barcode_id: Integer ID unique to each cell/barcode (or row if no barcode).imd_chain_id: Integer ID unique to each input row (chain).imd_chain_count: Integer count associated with the chain (1 for SC/simple, fromcount_colfor bulk). This output is typically assigned to the$annotationsfield of anImmunDataobject.
Details
The function performs the following main steps:
Validation: Checks inputs, schema validity, and existence of required columns.
Schema Parsing: Determines receptor features and target chains from
schema.Locus Filtering: If
schema$chainsis provided, filters the dataset to include only rows matching the specified locus/loci.Processing Logic (based on
barcode_colandcount_col):Simple Table/Bulk (No Barcodes): Assigns unique internal barcode/chain IDs. Identifies unique receptors based on
schema$features. Calculatesimd_chain_count(1 for simple table, fromcount_colfor bulk).Single-Cell (Barcodes Provided): Uses
barcode_colforimd_barcode_id.Single Chain: (
length(schema$chains) <= 1). Identifies unique receptors based onschema$features.imd_chain_countis 1.Paired Chain: (
length(schema$chains) == 2). Requireslocus_colandumi_col. Filters chains within each cell/locus group based on maxumi_col. Creates paired receptors by joining the two specified loci for each cell based onschema$featuresfrom both. Assigns a uniqueimd_receptor_idto each pair.imd_chain_countis 1 (representing the chain record).
Output: Returns an annotated data frame containing original columns plus internal identifiers (
imd_receptor_id,imd_barcode_id,imd_chain_id) and counts (imd_chain_count).
Internal column names are typically managed by immundata:::imd_schema().