Processes a table of immune receptor sequences (chains or clonotypes) to identify unique receptors based on a specified schema. It assigns a unique identifier (imd_receptor_id) to each distinct receptor signature and returns an annotated table linking the original sequence data to these receptor IDs.

This function is a core component used within read_repertoires() and handles different input data structures:

  • Simple tables (no counts, no cell IDs).

  • Bulk sequencing data (using a count column).

  • Single-cell data (using a barcode/cell ID column). For single-cell data, it can perform chain pairing if the schema specifies multiple chains (e.g., TRA and TRB).

agg_receptors(
  dataset,
  schema,
  barcode_col = NULL,
  count_col = NULL,
  locus_col = NULL,
  umi_col = NULL
)

Arguments

dataset

A data frame or duckplyr_df containing sequence/clonotype data. Must include columns specified in schema and potentially barcode_col, count_col, locus_col, umi_col. Could be idata$annotations.

schema

Defines how a unique receptor is identified. Can be:

  • A character vector of column names representing receptor features (e.g., c("v_call", "j_call", "junction_aa")).

  • A list created by make_receptor_schema(), specifying both features (character vector) and optionally chains (character vector of locus names like "TRA", "TRB", "IGH", "IGK", "IGL", max length 2). Specifying chains triggers filtering by locus and enables pairing logic if two chains are given.

barcode_col

Character(1). The name of the column containing cell identifiers (barcodes). Required for single-cell processing and chain pairing. Default: NULL.

count_col

Character(1). The name of the column containing counts (e.g., UMI counts for bulk, clonotype frequency). Used for bulk data processing. Default: NULL. Cannot be specified if barcode_col is set.

locus_col

Character(1). The name of the column specifying the chain locus (e.g., "TRA", "TRB"). Required if schema includes chains for filtering or pairing. Default: NULL.

umi_col

Character(1). The name of the column containing UMI counts. Required for paired-chain single-cell data (length(schema$chains) == 2). Used to select the most abundant chain per locus within a cell when multiple chains of the same locus are present. Default: NULL.

Value

A duckplyr_df (or data frame) representing the annotated sequences. This table links each original sequence record (chain) to a defined receptor and includes standardized columns:

  • imd_receptor_id: Integer ID unique to each distinct receptor signature.

  • imd_barcode_id: Integer ID unique to each cell/barcode (or row if no barcode).

  • imd_chain_id: Integer ID unique to each input row (chain).

  • imd_chain_count: Integer count associated with the chain (1 for SC/simple, from count_col for bulk). This output is typically assigned to the $annotations field of an ImmunData object.

Details

The function performs the following main steps:

  1. Validation: Checks inputs, schema validity, and existence of required columns.

  2. Schema Parsing: Determines receptor features and target chains from schema.

  3. Locus Filtering: If schema$chains is provided, filters the dataset to include only rows matching the specified locus/loci.

  4. Processing Logic (based on barcode_col and count_col):

    • Simple Table/Bulk (No Barcodes): Assigns unique internal barcode/chain IDs. Identifies unique receptors based on schema$features. Calculates imd_chain_count (1 for simple table, from count_col for bulk).

    • Single-Cell (Barcodes Provided): Uses barcode_col for imd_barcode_id.

      • Single Chain: (length(schema$chains) <= 1). Identifies unique receptors based on schema$features. imd_chain_count is 1.

      • Paired Chain: (length(schema$chains) == 2). Requires locus_col and umi_col. Filters chains within each cell/locus group based on max umi_col. Creates paired receptors by joining the two specified loci for each cell based on schema$features from both. Assigns a unique imd_receptor_id to each pair. imd_chain_count is 1 (representing the chain record).

  5. Output: Returns an annotated data frame containing original columns plus internal identifiers (imd_receptor_id, imd_barcode_id, imd_chain_id) and counts (imd_chain_count).

Internal column names are typically managed by immundata:::imd_schema().