Filter ImmunData by receptor features, barcodes or any annotations
Source:R/operations_filter.R
filter_immundata.RdProvides flexible filtering options for an ImmunData object.
filter() is the main function, allowing filtering based on receptor features
(e.g., CDR3 sequence) using various matching methods (exact, regex, fuzzy) and/or
standard dplyr-style filtering on annotation columns.
filter_barcodes() is a convenience function to filter by specific cell barcodes.
filter_receptors() is a convenience function to filter by specific receptor identifiers.
Usage
filter_immundata(idata, ..., seq_options = NULL, keep_repertoires = TRUE)
# S3 method for class 'ImmunData'
filter(
.data,
...,
.by = NULL,
.preserve = FALSE,
seq_options = NULL,
keep_repertoires = TRUE
)
filter_barcodes(idata, barcodes, keep_repertoires = TRUE)
filter_receptors(idata, receptors, keep_repertoires = TRUE)Arguments
- idata, .data
An
ImmunDataobject.- ...
For
filter, these are regulardplyr-style filtering expressions (e.g.,V_gene == "IGHV1-1",chain == "IGH") applied to the$annotationstable before sequence filtering. Ignored byfilter_barcodesandfilter_receptors.- seq_options
For
filter, an optional named list specifying sequence-based filtering options. Usemake_seq_options()for convenient creation. The list can contain:query_col(Character scalar): The name of the column in$annotationscontaining sequences to compare (e.g.,"CDR3_aa","FR1_nt").patterns(Character vector): A vector of sequences or regular expressions to match againstquery_col.method(Character scalar): The matching method. One of"exact","regex","lev"(Levenshtein distance), or"hamm"(Hamming distance). Defaults typically handled bymake_seq_options.max_dist(Numeric scalar): For fuzzy methods ("lev","hamm"), the maximum allowed distance. Rows with distance <=max_distare kept. Defaults typically handled bymake_seq_options.name_type(Character scalar): Determines column names in intermediate distance calculations if applicable ("index"or"pattern"). Passed through to internal annotation functions. Defaults typically handled bymake_seq_options. Ifseq_optionsisNULL(the default), no sequence-based filtering is performed.
- keep_repertoires
Logical scalar. If
TRUE(the default) and the inputidatahas repertoire information (idata$schema_repertoireis notNULL), the repertoire summaries will be recalculated based on the filtered data usingagg_repertoires(). IfFALSE, or if no repertoire schema exists, the returnedImmunDataobject will not contain repertoire summaries ($repertoireswill beNULL).- .by
Not used.
- .preserve
Not used.
- barcodes
For
filter_barcodes, a vector of cell identifiers (barcodes) to keep. Can be character, integer, or numeric.- receptors
For
filter_receptors, a vector of receptor identifiers to keep. Can be character, integer, or numeric.
Value
A new ImmunData object containing only the filtered annotations
(and potentially recalculated repertoire summaries). The schema remains the same.
Details
For filter:
User-provided
dplyr-style filters (...) are applied before any sequence-based filtering defined inseq_options.Sequence filtering compares values in the
query_colof the annotations table against the providedpatterns.Supported sequence matching methods are:
"exact": Keeps rows wherequery_colexactly matches any of thepatterns."regex": Keeps rows wherequery_colmatches any of the regular expressions inpatterns."lev"(Levenshtein distance): Keeps rows where the edit distance betweenquery_coland any pattern is less than or equal tomax_dist."hamm"(Hamming distance): Keeps rows where the Hamming distance (for equal length strings) betweenquery_coland any pattern is less than or equal tomax_dist.
The filtering operations act on the
$annotationstable. A newImmunDataobject is created containing only the rows (and corresponding receptors) that pass the filter(s).If
keep_repertoires = TRUE(and repertoire data exists in the input), the repertoire-level summaries ($repertoirestable) are recalculated based on the filtered annotations. Otherwise, the$repertoirestable in the output will beNULL.
For filter_barcodes and filter_receptors:
These functions provide a simpler interface for common filtering tasks based on cell barcodes or receptor IDs, respectively. They use efficient
semi_joinoperations internally.
Examples
# Basic setup (assuming idata_test is a valid ImmunData object)
# print(idata_test)
# --- filter examples ---
if (FALSE) { # \dontrun{
# Example 1: dplyr-style filtering on annotations
filtered_heavy <- filter(idata_test, chain == "IGH")
print(filtered_heavy)
# Example 2: Exact sequence matching on CDR3 amino acid sequence
cdr3_patterns <- c("CARGLGLVFYGMDVW", "CARDNRGAVAGVFGEAFYW")
seq_opts_exact <- make_seq_options(query_col = "CDR3_aa", patterns = cdr3_patterns)
filtered_exact_cdr3 <- filter(idata_test, seq_options = seq_opts_exact)
print(filtered_exact_cdr3)
# Example 3: Combining dplyr-style and fuzzy sequence matching (Levenshtein)
seq_opts_lev <- make_seq_options(
query_col = "CDR3_aa",
patterns = "CARGLGLVFYGMDVW",
method = "lev",
max_dist = 1
)
filtered_combined <- filter(idata_test,
chain == "IGH",
C_gene == "IGHG1",
seq_options = seq_opts_lev
)
print(filtered_combined)
# Example 4: Regex matching on V gene
v_gene_pattern <- "^IGHV[13]-" # Keep only IGHV1 or IGHV3 families
seq_opts_regex <- make_seq_options(
query_col = "V_gene",
patterns = v_gene_pattern,
method = "regex"
)
filtered_regex_v <- filter(idata_test, seq_options = seq_opts_regex)
print(filtered_regex_v)
# Example 5: Filtering without recalculating repertoires
filtered_no_rep <- filter(idata_test, chain == "IGK", keep_repertoires = FALSE)
print(filtered_no_rep) # $repertoires should be NULL
} # }
# --- filter_barcodes example ---
if (FALSE) { # \dontrun{
# Assuming 'cell1_barcode' and 'cell5_barcode' exist in idata_test$annotations$cell_id
specific_barcodes <- c("cell1_barcode", "cell5_barcode")
filtered_cells <- filter_barcodes(idata_test, barcodes = specific_barcodes)
print(filtered_cells)
} # }
# --- filter_receptors example ---
if (FALSE) { # \dontrun{
# Assuming receptor IDs 101 and 205 exist in idata_test$annotations$receptor_id
specific_receptors <- c(101, 205) # Or character IDs if applicable
filtered_recs <- filter_receptors(idata_test, receptors = specific_receptors)
print(filtered_recs)
} # }