Filter ImmunData by receptor features, barcodes or any annotations
Source:R/operations_filter.R
filter_immundata.Rd
Provides flexible filtering options for an ImmunData
object.
filter()
is the main function, allowing filtering based on receptor features
(e.g., CDR3 sequence) using various matching methods (exact, regex, fuzzy) and/or
standard dplyr
-style filtering on annotation columns.
filter_barcodes()
is a convenience function to filter by specific cell barcodes.
filter_receptors()
is a convenience function to filter by specific receptor identifiers.
Usage
filter_immundata(idata, ..., seq_options = NULL, keep_repertoires = TRUE)
# S3 method for class 'ImmunData'
filter(
.data,
...,
.by = NULL,
.preserve = FALSE,
seq_options = NULL,
keep_repertoires = TRUE
)
filter_barcodes(idata, barcodes, keep_repertoires = TRUE)
filter_receptors(idata, receptors, keep_repertoires = TRUE)
Arguments
- idata, .data
An
ImmunData
object.- ...
For
filter
, these are regulardplyr
-style filtering expressions (e.g.,V_gene == "IGHV1-1"
,chain == "IGH"
) applied to the$annotations
table before sequence filtering. Ignored byfilter_barcodes
andfilter_receptors
.- seq_options
For
filter
, an optional named list specifying sequence-based filtering options. Usemake_seq_options()
for convenient creation. The list can contain:query_col
(Character scalar): The name of the column in$annotations
containing sequences to compare (e.g.,"CDR3_aa"
,"FR1_nt"
).patterns
(Character vector): A vector of sequences or regular expressions to match againstquery_col
.method
(Character scalar): The matching method. One of"exact"
,"regex"
,"lev"
(Levenshtein distance), or"hamm"
(Hamming distance). Defaults typically handled bymake_seq_options
.max_dist
(Numeric scalar): For fuzzy methods ("lev"
,"hamm"
), the maximum allowed distance. Rows with distance <=max_dist
are kept. Defaults typically handled bymake_seq_options
.name_type
(Character scalar): Determines column names in intermediate distance calculations if applicable ("index"
or"pattern"
). Passed through to internal annotation functions. Defaults typically handled bymake_seq_options
. Ifseq_options
isNULL
(the default), no sequence-based filtering is performed.
- keep_repertoires
Logical scalar. If
TRUE
(the default) and the inputidata
has repertoire information (idata$schema_repertoire
is notNULL
), the repertoire summaries will be recalculated based on the filtered data usingagg_repertoires()
. IfFALSE
, or if no repertoire schema exists, the returnedImmunData
object will not contain repertoire summaries ($repertoires
will beNULL
).- .by
Not used.
- .preserve
Not used.
- barcodes
For
filter_barcodes
, a vector of cell identifiers (barcodes) to keep. Can be character, integer, or numeric.- receptors
For
filter_receptors
, a vector of receptor identifiers to keep. Can be character, integer, or numeric.
Value
A new ImmunData
object containing only the filtered annotations
(and potentially recalculated repertoire summaries). The schema remains the same.
Details
For filter
:
User-provided
dplyr
-style filters (...
) are applied before any sequence-based filtering defined inseq_options
.Sequence filtering compares values in the
query_col
of the annotations table against the providedpatterns
.Supported sequence matching methods are:
"exact"
: Keeps rows wherequery_col
exactly matches any of thepatterns
."regex"
: Keeps rows wherequery_col
matches any of the regular expressions inpatterns
."lev"
(Levenshtein distance): Keeps rows where the edit distance betweenquery_col
and any pattern is less than or equal tomax_dist
."hamm"
(Hamming distance): Keeps rows where the Hamming distance (for equal length strings) betweenquery_col
and any pattern is less than or equal tomax_dist
.
The filtering operations act on the
$annotations
table. A newImmunData
object is created containing only the rows (and corresponding receptors) that pass the filter(s).If
keep_repertoires = TRUE
(and repertoire data exists in the input), the repertoire-level summaries ($repertoires
table) are recalculated based on the filtered annotations. Otherwise, the$repertoires
table in the output will beNULL
.
For filter_barcodes
and filter_receptors
:
These functions provide a simpler interface for common filtering tasks based on cell barcodes or receptor IDs, respectively. They use efficient
semi_join
operations internally.
Examples
# Basic setup (assuming idata_test is a valid ImmunData object)
# print(idata_test)
# --- filter examples ---
if (FALSE) { # \dontrun{
# Example 1: dplyr-style filtering on annotations
filtered_heavy <- filter(idata_test, chain == "IGH")
print(filtered_heavy)
# Example 2: Exact sequence matching on CDR3 amino acid sequence
cdr3_patterns <- c("CARGLGLVFYGMDVW", "CARDNRGAVAGVFGEAFYW")
seq_opts_exact <- make_seq_options(query_col = "CDR3_aa", patterns = cdr3_patterns)
filtered_exact_cdr3 <- filter(idata_test, seq_options = seq_opts_exact)
print(filtered_exact_cdr3)
# Example 3: Combining dplyr-style and fuzzy sequence matching (Levenshtein)
seq_opts_lev <- make_seq_options(
query_col = "CDR3_aa",
patterns = "CARGLGLVFYGMDVW",
method = "lev",
max_dist = 1
)
filtered_combined <- filter(idata_test,
chain == "IGH",
C_gene == "IGHG1",
seq_options = seq_opts_lev
)
print(filtered_combined)
# Example 4: Regex matching on V gene
v_gene_pattern <- "^IGHV[13]-" # Keep only IGHV1 or IGHV3 families
seq_opts_regex <- make_seq_options(
query_col = "V_gene",
patterns = v_gene_pattern,
method = "regex"
)
filtered_regex_v <- filter(idata_test, seq_options = seq_opts_regex)
print(filtered_regex_v)
# Example 5: Filtering without recalculating repertoires
filtered_no_rep <- filter(idata_test, chain == "IGK", keep_repertoires = FALSE)
print(filtered_no_rep) # $repertoires should be NULL
} # }
# --- filter_barcodes example ---
if (FALSE) { # \dontrun{
# Assuming 'cell1_barcode' and 'cell5_barcode' exist in idata_test$annotations$cell_id
specific_barcodes <- c("cell1_barcode", "cell5_barcode")
filtered_cells <- filter_barcodes(idata_test, barcodes = specific_barcodes)
print(filtered_cells)
} # }
# --- filter_receptors example ---
if (FALSE) { # \dontrun{
# Assuming receptor IDs 101 and 205 exist in idata_test$annotations$receptor_id
specific_receptors <- c(101, 205) # Or character IDs if applicable
filtered_recs <- filter_receptors(idata_test, receptors = specific_receptors)
print(filtered_recs)
} # }