Processes a table of immune receptor sequences (chains or clonotypes) to
identify unique receptors based on a specified schema. It assigns a unique
identifier (imd_receptor_id
) to each distinct receptor signature and
returns an annotated table linking the original sequence data to these
receptor IDs.
This function is a core component used within read_repertoires()
and handles
different input data structures:
Simple tables (no counts, no cell IDs).
Bulk sequencing data (using a count column).
Single-cell data (using a barcode/cell ID column). For single-cell data, it can perform chain pairing if the schema specifies multiple chains (e.g., TRA and TRB).
agg_receptors(
dataset,
schema,
barcode_col = NULL,
count_col = NULL,
locus_col = NULL,
umi_col = NULL
)
A data frame or duckplyr_df
containing sequence/clonotype data.
Must include columns specified in schema
and potentially barcode_col
,
count_col
, locus_col
, umi_col
. Could be idata$annotations
.
Defines how a unique receptor is identified. Can be:
A character vector of column names representing receptor features
(e.g., c("v_call", "j_call", "junction_aa")
).
A list created by make_receptor_schema()
, specifying both features
(character vector) and optionally chains
(character vector of locus
names like "TRA"
, "TRB"
, "IGH"
, "IGK"
, "IGL"
, max length 2).
Specifying chains
triggers filtering by locus and enables pairing logic
if two chains are given.
Character(1). The name of the column containing cell
identifiers (barcodes). Required for single-cell processing and chain pairing.
Default: NULL
.
Character(1). The name of the column containing counts
(e.g., UMI counts for bulk, clonotype frequency). Used for bulk data
processing. Default: NULL
. Cannot be specified if barcode_col
is set.
Character(1). The name of the column specifying the chain locus
(e.g., "TRA", "TRB"). Required if schema
includes chains
for filtering
or pairing. Default: NULL
.
Character(1). The name of the column containing UMI counts.
Required for paired-chain single-cell data (length(schema$chains) == 2
).
Used to select the most abundant chain per locus within a cell when multiple
chains of the same locus are present. Default: NULL
.
A duckplyr_df
(or data frame) representing the annotated sequences.
This table links each original sequence record (chain) to a defined receptor
and includes standardized columns:
imd_receptor_id
: Integer ID unique to each distinct receptor signature.
imd_barcode_id
: Integer ID unique to each cell/barcode (or row if no barcode).
imd_chain_id
: Integer ID unique to each input row (chain).
imd_chain_count
: Integer count associated with the chain (1 for SC/simple,
from count_col
for bulk).
This output is typically assigned to the $annotations
field of an ImmunData
object.
The function performs the following main steps:
Validation: Checks inputs, schema validity, and existence of required columns.
Schema Parsing: Determines receptor features and target chains from schema
.
Locus Filtering: If schema$chains
is provided, filters the dataset
to include only rows matching the specified locus/loci.
Processing Logic (based on barcode_col
and count_col
):
Simple Table/Bulk (No Barcodes): Assigns unique internal barcode/chain IDs.
Identifies unique receptors based on schema$features
. Calculates
imd_chain_count
(1 for simple table, from count_col
for bulk).
Single-Cell (Barcodes Provided): Uses barcode_col
for imd_barcode_id
.
Single Chain: (length(schema$chains) <= 1
). Identifies unique
receptors based on schema$features
. imd_chain_count
is 1.
Paired Chain: (length(schema$chains) == 2
). Requires locus_col
and umi_col
. Filters chains within each cell/locus group based
on max umi_col
. Creates paired receptors by joining the two
specified loci for each cell based on schema$features
from both.
Assigns a unique imd_receptor_id
to each pair.
imd_chain_count
is 1 (representing the chain record).
Output: Returns an annotated data frame containing original columns plus
internal identifiers (imd_receptor_id
, imd_barcode_id
, imd_chain_id
)
and counts (imd_chain_count
).
Internal column names are typically managed by immundata:::imd_schema()
.