Processes a table of immune receptor sequences (chains or clonotypes) to
identify unique receptors based on a specified schema. It assigns a unique
identifier (imd_receptor_id
) to each distinct receptor signature and
returns an annotated table linking the original sequence data to these
receptor IDs.
This function is a core component used within read_repertoires()
and handles
different input data structures:
Simple tables (no counts, no cell IDs).
Bulk sequencing data (using a count column).
Single-cell data (using a barcode/cell ID column). For single-cell data, it can perform chain pairing if the schema specifies multiple chains (e.g., TRA and TRB).
Usage
agg_receptors(
dataset,
schema,
barcode_col = NULL,
count_col = NULL,
locus_col = NULL,
umi_col = NULL
)
Arguments
- dataset
A data frame or
duckplyr_df
containing sequence/clonotype data. Must include columns specified inschema
and potentiallybarcode_col
,count_col
,locus_col
,umi_col
. Could beidata$annotations
.- schema
Defines how a unique receptor is identified. Can be:
A character vector of column names representing receptor features (e.g.,
c("v_call", "j_call", "junction_aa")
).A list created by
make_receptor_schema()
, specifying bothfeatures
(character vector) and optionallychains
(character vector of locus names like"TRA"
,"TRB"
,"IGH"
,"IGK"
,"IGL"
, max length 2). Specifyingchains
triggers filtering by locus and enables pairing logic if two chains are given.
- barcode_col
Character(1). The name of the column containing cell identifiers (barcodes). Required for single-cell processing and chain pairing. Default:
NULL
.- count_col
Character(1). The name of the column containing counts (e.g., UMI counts for bulk, clonotype frequency). Used for bulk data processing. Default:
NULL
. Cannot be specified ifbarcode_col
is set.- locus_col
Character(1). The name of the column specifying the chain locus (e.g., "TRA", "TRB"). Required if
schema
includeschains
for filtering or pairing. Default:NULL
.- umi_col
Character(1). The name of the column containing UMI counts. Required for paired-chain single-cell data (
length(schema$chains) == 2
). Used to select the most abundant chain per locus within a cell when multiple chains of the same locus are present. Default:NULL
.
Value
A duckplyr_df
(or data frame) representing the annotated sequences.
This table links each original sequence record (chain) to a defined receptor
and includes standardized columns:
imd_receptor_id
: Integer ID unique to each distinct receptor signature.imd_barcode_id
: Integer ID unique to each cell/barcode (or row if no barcode).imd_chain_id
: Integer ID unique to each input row (chain).imd_chain_count
: Integer count associated with the chain (1 for SC/simple, fromcount_col
for bulk). This output is typically assigned to the$annotations
field of anImmunData
object.
Details
The function performs the following main steps:
Validation: Checks inputs, schema validity, and existence of required columns.
Schema Parsing: Determines receptor features and target chains from
schema
.Locus Filtering: If
schema$chains
is provided, filters the dataset to include only rows matching the specified locus/loci.Processing Logic (based on
barcode_col
andcount_col
):Simple Table/Bulk (No Barcodes): Assigns unique internal barcode/chain IDs. Identifies unique receptors based on
schema$features
. Calculatesimd_chain_count
(1 for simple table, fromcount_col
for bulk).Single-Cell (Barcodes Provided): Uses
barcode_col
forimd_barcode_id
.Single Chain: (
length(schema$chains) <= 1
). Identifies unique receptors based onschema$features
.imd_chain_count
is 1.Paired Chain: (
length(schema$chains) == 2
). Requireslocus_col
andumi_col
. Filters chains within each cell/locus group based on maxumi_col
. Creates paired receptors by joining the two specified loci for each cell based onschema$features
from both. Assigns a uniqueimd_receptor_id
to each pair.imd_chain_count
is 1 (representing the chain record).
Output: Returns an annotated data frame containing original columns plus internal identifiers (
imd_receptor_id
,imd_barcode_id
,imd_chain_id
) and counts (imd_chain_count
).
Internal column names are typically managed by immundata:::imd_schema()
.