Filter ImmunData by receptor features, barcodes or any annotations

Provides flexible filtering options for an ImmunData object.

filter() is the main function, allowing filtering based on receptor features (e.g., CDR3 sequence) using various matching methods (exact, regex, fuzzy) and/or standard dplyr-style filtering on annotation columns.

filter_barcodes() is a convenience function to filter by specific cell barcodes.

filter_receptors() is a convenience function to filter by specific receptor identifiers.

Usage

filter_immundata(idata, ..., seq_options = NULL, keep_repertoires = TRUE)

# S3 method for class 'ImmunData'
filter(
  .data,
  ...,
  .by = NULL,
  .preserve = FALSE,
  seq_options = NULL,
  keep_repertoires = TRUE
)

filter_barcodes(idata, barcodes, keep_repertoires = TRUE)

filter_receptors(idata, receptors, keep_repertoires = TRUE)

Arguments

idata, .data

An ImmunData object.

...

For filter, these are regular dplyr-style filtering expressions (e.g., V_gene == "IGHV1-1", chain == "IGH") applied to the $annotations table before sequence filtering. Ignored by filter_barcodes and filter_receptors.

seq_options

For filter, an optional named list specifying sequence-based filtering options. Use make_seq_options() for convenient creation. The list can contain:

query_col (Character scalar): The name of the column in $annotations containing sequences to compare (e.g., "CDR3_aa", "FR1_nt").
patterns (Character vector): A vector of sequences or regular expressions to match against query_col.
method (Character scalar): The matching method. One of "exact", "regex", "lev" (Levenshtein distance), or "hamm" (Hamming distance). Defaults typically handled by make_seq_options.
max_dist (Numeric scalar): For fuzzy methods ("lev", "hamm"), the maximum allowed distance. Rows with distance <= max_dist are kept. Defaults typically handled by make_seq_options.
name_type (Character scalar): Determines column names in intermediate distance calculations if applicable ("index" or "pattern"). Passed through to internal annotation functions. Defaults typically handled by make_seq_options. If seq_options is NULL (the default), no sequence-based filtering is performed.

keep_repertoires

Logical scalar. If TRUE (the default) and the input idata has repertoire information (idata$schema_repertoire is not NULL), the repertoire summaries will be recalculated based on the filtered data using agg_repertoires(). If FALSE, or if no repertoire schema exists, the returned ImmunData object will not contain repertoire summaries ($repertoires will be NULL).

.by

Not used.

.preserve

Not used.

barcodes

For filter_barcodes, a vector of cell identifiers (barcodes) to keep. Can be character, integer, or numeric.

receptors

For filter_receptors, a vector of receptor identifiers to keep. Can be character, integer, or numeric.

Value

A new ImmunData object containing only the filtered annotations (and potentially recalculated repertoire summaries). The schema remains the same.

Details

For filter:

User-provided dplyr-style filters (...) are applied before any sequence-based filtering defined in seq_options.
Sequence filtering compares values in the query_col of the annotations table against the provided patterns.
Supported sequence matching methods are:
- "exact": Keeps rows where query_col exactly matches any of the patterns.
- "regex": Keeps rows where query_col matches any of the regular expressions in patterns.
- "lev" (Levenshtein distance): Keeps rows where the edit distance between query_col and any pattern is less than or equal to max_dist.
- "hamm" (Hamming distance): Keeps rows where the Hamming distance (for equal length strings) between query_col and any pattern is less than or equal to max_dist.
The filtering operations act on the $annotations table. A new ImmunData object is created containing only the rows (and corresponding receptors) that pass the filter(s).
If keep_repertoires = TRUE (and repertoire data exists in the input), the repertoire-level summaries ($repertoires table) are recalculated based on the filtered annotations. Otherwise, the $repertoires table in the output will be NULL.

For filter_barcodes and filter_receptors:

These functions provide a simpler interface for common filtering tasks based on cell barcodes or receptor IDs, respectively. They use efficient semi_join operations internally.

Examples

# Basic setup (assuming idata_test is a valid ImmunData object)
# print(idata_test)

# --- filter examples ---
if (FALSE) { # \dontrun{
# Example 1: dplyr-style filtering on annotations
filtered_heavy <- filter(idata_test, chain == "IGH")
print(filtered_heavy)

# Example 2: Exact sequence matching on CDR3 amino acid sequence
cdr3_patterns <- c("CARGLGLVFYGMDVW", "CARDNRGAVAGVFGEAFYW")
seq_opts_exact <- make_seq_options(query_col = "CDR3_aa", patterns = cdr3_patterns)
filtered_exact_cdr3 <- filter(idata_test, seq_options = seq_opts_exact)
print(filtered_exact_cdr3)

# Example 3: Combining dplyr-style and fuzzy sequence matching (Levenshtein)
seq_opts_lev <- make_seq_options(
  query_col = "CDR3_aa",
  patterns = "CARGLGLVFYGMDVW",
  method = "lev",
  max_dist = 1
)
filtered_combined <- filter(idata_test,
  chain == "IGH",
  C_gene == "IGHG1",
  seq_options = seq_opts_lev
)
print(filtered_combined)

# Example 4: Regex matching on V gene
v_gene_pattern <- "^IGHV[13]-" # Keep only IGHV1 or IGHV3 families
seq_opts_regex <- make_seq_options(
  query_col = "V_gene",
  patterns = v_gene_pattern,
  method = "regex"
)
filtered_regex_v <- filter(idata_test, seq_options = seq_opts_regex)
print(filtered_regex_v)

# Example 5: Filtering without recalculating repertoires
filtered_no_rep <- filter(idata_test, chain == "IGK", keep_repertoires = FALSE)
print(filtered_no_rep) # $repertoires should be NULL
} # }

# --- filter_barcodes example ---
if (FALSE) { # \dontrun{
# Assuming 'cell1_barcode' and 'cell5_barcode' exist in idata_test$annotations$cell_id
specific_barcodes <- c("cell1_barcode", "cell5_barcode")
filtered_cells <- filter_barcodes(idata_test, barcodes = specific_barcodes)
print(filtered_cells)
} # }

# --- filter_receptors example ---
if (FALSE) { # \dontrun{
# Assuming receptor IDs 101 and 205 exist in idata_test$annotations$receptor_id
specific_receptors <- c(101, 205) # Or character IDs if applicable
filtered_recs <- filter_receptors(idata_test, receptors = specific_receptors)
print(filtered_recs)
} # }