Annotate ImmunData object — annotate

Joins additional annotation data to the annotations slot of an ImmunData object.

This function allows you to add extra information to your repertoire data by joining a dataframe of annotations based on specified columns. It supports joining by one or more columns.

Usage

annotate_immundata(
  idata,
  annotations,
  by,
  keep_repertoires = TRUE,
  remove_limit = FALSE
)

annotate(idata, annotations, by, keep_repertoires = TRUE, remove_limit = FALSE)

annotate_receptors(
  idata,
  annotations,
  annot_col = imd_schema("receptor"),
  keep_repertoires = TRUE,
  remove_limit = FALSE
)

annotate_barcodes(
  idata,
  annotations,
  annot_col = "<rownames>",
  keep_repertoires = TRUE,
  remove_limit = FALSE
)

annotate_chains(
  idata,
  annotations,
  annot_col = imd_schema("chain"),
  keep_repertoires = TRUE,
  remove_limit = FALSE
)

Arguments

idata: An ImmunData R6 object containing repertoire and annotation data.
annotations: A data frame containing the annotations to be joined.
by: A named character vector specifying the columns to join by. The names of the vector should be the column names in idata$annotations and the values should be the corresponding column names in the annotations data frame.
keep_repertoires: Logical. If TRUE (default) and the ImmunData object contains repertoire data (idata$schema_repertoire is not NULL), the repertoires will be re-aggregated after joining the annotations. Set to FALSE if you do not want to re-aggregate repertoires immediately.
remove_limit: Logical. If FALSE (default), a warning will be issued if the annotations data frame has 100 or more columns, suggesting potential performance issues. Set to TRUE to disable this warning and allow joining of annotations with an arbitrary number of columns. Use with caution, as joining wide dataframes can be memory-intensive and slow.
annot_col: A character vector specifying the column with receptor, barcode or chain identifiers to annotate a corresponding receptors, barode or chains in idata.

Value

A new ImmunData object with the annotations joined to the annotations slot.

Details

The function performs a left join operation, keeping all rows from idata$annotations and adding matching columns from the annotations data frame. If there are multiple matches in annotations for a row in idata$annotations, all combinations will be returned, potentially increasing the number of rows in the resulting annotations table.

The function uses checkmate to validate the input types and structure.

A check is performed to ensure that the columns specified in by exist in both idata$annotations and the annotations data frame.

The annotations data frame is converted to a duckdb tibble internally for efficient joining, especially with large datasets.

Warning

By default (remove_limit = FALSE), joining an annotations data frame with 100 or more columns will trigger a warning. This is a safeguard to prevent accidental joining of very wide data (e.g., gene expression data) that could lead to performance degradation or crashes. If you understand the risks and intend to join a wide data frame, set remove_limit = TRUE.

Examples

if (FALSE) { # \dontrun{
# Assuming 'my_immun_data' is an ImmunData object and 'sample_info' is a data frame
# with a column 'sample_id' matching 'sample' in my_immun_data$annotations
# and additional columns like 'treatment' and 'disease_status'.

sample_info <- data.frame(
  sample_id = c("sample1", "sample2", "sample3", "sample4"),
  treatment = c("Treatment A", "Treatment B", "Treatment A", "Treatment C"),
  disease_status = c("Healthy", "Disease", "Healthy", "Disease"),
  stringsAsFactors = FALSE # Important to keep characters as characters
)

# Join sample information using the 'sample' column
my_immun_data_annotated <- annotate(
  idata = my_immun_data,
  annotations = sample_info,
  by = c("sample" = "sample_id")
)

# New sample_info

# Join data by multiple columns, e.g., 'sample' and 'barcode'
# Assuming 'cell_annotations' is a data frame with 'sample_barcode' and 'cell_type'
my_immun_data_cell_annotated <- annotate(
  idata = my_immun_data,
  annotations = cell_annotations,
  by = c("sample" = "sample", "barcode" = "sample_barcode")
)

# Join a wide dataframe, suppressing the column limit warning
# Assuming 'gene_expression' is a data frame with 'barcode' and many gene columns
my_immun_data_gene_expression <- annotate(
  idata = my_immun_data,
  annotations = gene_expression,
  by = c("barcode" = "barcode"),
  remove_limit = TRUE
)
} # }