This function creates a filter_mut`` column that will be read by the \code{calculate_mf} function and other downstream functions. Variants with filter_mut == TRUE“ will be excluded from group mutation
counts. This function may also remove records upon on user specification.
Running this function again on the same data will not overide the previous
filters. To reset previous filters, set the filter_mut column values to
FALSE.
Usage
filter_mut(
mutation_data,
vaf_cutoff = 1,
snv_in_germ_mnv = FALSE,
rm_abnormal_vaf = FALSE,
custom_filter_col = NULL,
custom_filter_val = NULL,
custom_filter_rm = FALSE,
regions = NULL,
regions_filter,
allow_half_overlap = FALSE,
rg_sep = "\t",
is_0_based_rg = TRUE,
rm_filtered_mut_from_depth = FALSE,
return_filtered_rows = FALSE
)Arguments
- mutation_data
Your mutation data. This can be a data frame or a GRanges object.
- vaf_cutoff
Filter out ostensibly germline variants using a cutoff for variant allele fraction (VAF). Any variant with a
vaflarger than the cutoff will be filtered. The default is 1 (no filtering). It is recommended to use a value of 0.01 (i.e. 1%) as a conservative approach to retain only somatic variants.- snv_in_germ_mnv
Filter out snv variants that overlap with germline mnv variants within the same samples. mnv variants will be considered germline if their vaf > vaf_cutoff. Default is FALSE.
- rm_abnormal_vaf
A logical value. If TRUE, rows in
mutation_datawith a variant allele fraction (VAF) between 0.05 and 0.45 or between 0.55 and 0.95 will be removed. We expect variants to have a VAF ~0. 0.5, or 1, reflecting rare somatic mutations, heterozygous germline mutations, and homozygous germline mutations, respectively. Default is FALSE.- custom_filter_col
The name of the column in mutation_data to apply a custom filter to. This column will be checked for specific values, as defined by
custom_filter_val. If any row in this column contains one of the specified values, that row will either be flagged in thefilter_mut columnor, if specified bycustom_filter_rm, removed from mutation_data.- custom_filter_val
A set of values used to filter rows in
mutation_databased oncustom_filter_col. If a row incustom_filter_colmatches any value incustom_filter_val, it will either be set to TRUE in thefilter_mutcolumn or removed, depending oncustom_filter_rm.- custom_filter_rm
A logical value. If TRUE, rows in custom_filter_col that match any value in custom_filter_val will be removed from the mutation_data. If FALSE,
filter_mutwill be set to TRUE for those rows.- regions
Remove rows that are within/outside of specified regions.
regionscan be either a file path, a data frame, or a GRanges object containing the genomic ranges by which to filter. File paths will be read using the rg_sep. Users can also choose from the built-in TwinStrand's Mutagenesis Panels by inputting "TSpanel_human", "TSpanel_mouse", or "TSpanel_rat". Required columns for the regions file are "contig", "start", and "end". In a GRanges object, the required columns are "seqnames", "start", and "end".- regions_filter
Specifies how the provided
regionsshould be applied tomutation_data. Acceptable values are "remove_within" or "keep_within". If set to "remove_within", records that fall within the specified regions wil be removed from mutation_data. If set to "keep_within", only records within the specified regions will be kept in mutation_data, and all other records will be removed.- allow_half_overlap
A logical value. If TRUE, records that start or end in your
regions, but extend outside of them in either direction will be included in the filter. If FALSE, only records that start and end within theregionswill be included in the filter. Default is FALSE.- rg_sep
The delimiter for importing the custom_regions. The default is tab-delimited "\t".
- is_0_based_rg
A logical variable. Indicates whether the position coordinates in
regionsare 0 based (TRUE) or 1 based (FALSE). If TRUE, positions will be converted to 1-based (start + 1). Need not be supplied for TSpanels. Default is TRUE.- rm_filtered_mut_from_depth
A logical value. If TRUE, the function will subtract the
alt_depthof records that were flagged by thefilter_mutcolumn from theirtotal_depth. This will treat flagged variants as No-calls. This will not apply to variants flagged as germline by thevaf_cutoff. However, if the germline variant has additional filters applied, then the subtraction will still occur. If FALSE, thealt_depthwill be retained in thetotal_depthfor all variants. Default is FALSE.- return_filtered_rows
A logical value. If TRUE, the function will return both the filtered mutation data and the records that were removed/flagged in a seperate data frame. The two dataframes will be returned inside a list, with names
mutation_dataandfiltered_rows. Default is FALSE.
Value
A data frame or a list of two data frames, depending on the
value of return_filtered_rows. If return_filtered_rows is
FALSE (default), a data frame of the same structure as mutation_data
is returned, with an additional column, filter_mut, indicating
whether each record has been flagged for filtering (TRUE) or not (FALSE).
If return_filtered_rows is TRUE, a list containing two data frames
is returned. The first data frame, named mutation_data, is the
filtered mutation data as described above. The second data frame,
named filtered_rows, contains all records that were either
removed from mutation_data or flagged with filter_mut == TRUE.
Examples
# Example data consists of 24 mouse bone marrow DNA samples imported
# using import_mut_data(). Sequenced on TS Mouse Mutagenesis Panel.
# Example data is retrieved from MutSeqRData, an ExperimentHub data package
if (requireNamespace("MutSeqRData", quietly = TRUE)) {
library(ExperimentHub)
eh <- ExperimentHub()
example_data <- eh[["EH9860"]]
# In this example, we will apply the following filters:
# 1) Filter out putative germline variants using a VAF cutoff of 0.01
# 2) Remove rows whose position falls outside the intervals of the
# TwinStrand Mouse Mutagenesis Panel regions.
# 3) Apply a custom filter to flag rows with "EndRepairFillInArtifact"
# in the column 'filter'. This is a filter step commonly applied to
# TwinStrand Duplex Sequencing data.
# 4) Flag snv variants that overlap with germline mnv variants and
# 5) Subtract the alt_depth of these variants from their total_depth
# (treat them as No-calls).
# 6) Return all the flagged/removed rows in a seperate data frame.
filter_example <- filter_mut(
mutation_data = example_data,
vaf_cutoff = 0.01,
regions = "TSpanel_mouse",
regions_filter = "keep_within",
custom_filter_col = "filter",
custom_filter_val = "EndRepairFillInArtifact",
custom_filter_rm = FALSE, # Flagging, not removing
snv_in_germ_mnv = TRUE,
rm_filtered_mut_from_depth = TRUE,
return_filtered_rows = TRUE
)
# Flagging germline mutations...
# Found 612 germline mutations.
# Flagging SNVs overlapping with germline MNVs...
# Found 20 SNVs overlapping with germline MNVs.
# Applying custom filter...
# Flagged 2021 rows with values in <filter> column that matched
# EndRepairFillInArtifact
# Applying region filter...
# Removed 22 rows based on regions.
# Correcting depth...
# 909 rows had their total_depth corrected.
# Removing filtered mutations from the total_depth...
# Filtering complete.
# Returning a list: mutation_data and filtered_rows.
# To separately access the filtered rows and the filtered mutation data:
filtered_rows <- filter_example$filtered_rows
filtered_example_mutation_data <- filter_example$mutation_data
}
