This function creates a filter_mut`` column that will be read by the \code{calculate_mf} function and other downstream functions. Variants with
filter_mut == TRUE“ will be excluded from group mutation
counts. This function may also remove records upon on user specification.
Running this function again on the same data will not overide the previous
filters. To reset previous filters, set the filter_mut column values to
FALSE.
Usage
filter_mut(
mutation_data,
vaf_cutoff = 1,
snv_in_germ_mnv = FALSE,
rm_abnormal_vaf = FALSE,
custom_filter_col = NULL,
custom_filter_val = NULL,
custom_filter_rm = FALSE,
regions = NULL,
regions_filter,
allow_half_overlap = FALSE,
rg_sep = "\t",
is_0_based_rg = TRUE,
rm_filtered_mut_from_depth = FALSE,
return_filtered_rows = FALSE
)
Arguments
- mutation_data
Your mutation data. This can be a data frame or a GRanges object.
- vaf_cutoff
Filter out ostensibly germline variants using a cutoff for variant allele fraction (VAF). Any variant with a
vaf
larger than the cutoff will be filtered. The default is 1 (no filtering). It is recommended to use a value of 0.01 (i.e. 1%) as a conservative approach to retain only somatic variants.- snv_in_germ_mnv
Filter out snv variants that overlap with germline mnv variants within the same samples. mnv variants will be considered germline if their vaf > vaf_cutoff. Default is FALSE.
- rm_abnormal_vaf
A logical value. If TRUE, rows in
mutation_data
with a variant allele fraction (VAF) between 0.05 and 0.45 or between 0.55 and 0.95 will be removed. We expect variants to have a VAF ~0. 0.5, or 1, reflecting rare somatic mutations, heterozygous germline mutations, and homozygous germline mutations, respectively. Default is FALSE.- custom_filter_col
The name of the column in mutation_data to apply a custom filter to. This column will be checked for specific values, as defined by
custom_filter_val
. If any row in this column contains one of the specified values, that row will either be flagged in thefilter_mut column
or, if specified bycustom_filter_rm
, removed from mutation_data.- custom_filter_val
A set of values used to filter rows in
mutation_data
based oncustom_filter_col
. If a row incustom_filter_col
matches any value incustom_filter_val
, it will either be set to TRUE in thefilter_mut
column or removed, depending oncustom_filter_rm
.- custom_filter_rm
A logical value. If TRUE, rows in custom_filter_col that match any value in custom_filter_val will be removed from the mutation_data. If FALSE,
filter_mut
will be set to TRUE for those rows.- regions
Remove rows that are within/outside of specified regions.
regions
can be either a file path, a data frame, or a GRanges object containing the genomic ranges by which to filter. File paths will be read using the rg_sep. Users can also choose from the built-in TwinStrand's Mutagenesis Panels by inputting "TSpanel_human", "TSpanel_mouse", or "TSpanel_rat". Required columns for the regions file are "contig", "start", and "end". In a GRanges object, the required columns are "seqnames", "start", and "end".- regions_filter
Specifies how the provided
regions
should be applied tomutation_data
. Acceptable values are "remove_within" or "keep_within". If set to "remove_within", records that fall within the specified regions wil be removed from mutation_data. If set to "keep_within", only records within the specified regions will be kept in mutation_data, and all other records will be removed.- allow_half_overlap
A logical value. If TRUE, records that start or end in your
regions
, but extend outside of them in either direction will be included in the filter. If FALSE, only records that start and end within theregions
will be included in the filter. Default is FALSE.- rg_sep
The delimiter for importing the custom_regions. The default is tab-delimited "\t".
- is_0_based_rg
A logical variable. Indicates whether the position coordinates in
regions
are 0 based (TRUE) or 1 based (FALSE). If TRUE, positions will be converted to 1-based (start + 1). Need not be supplied for TSpanels. Default is TRUE.- rm_filtered_mut_from_depth
A logical value. If TRUE, the function will subtract the
alt_depth
of records that were flagged by thefilter_mut
column from theirtotal_depth
. This will treat flagged variants as No-calls. This will not apply to variants flagged as germline by thevaf_cutoff
. However, if the germline variant has additional filters applied, then the subtraction will still occur. If FALSE, thealt_depth
will be retained in thetotal_depth
for all variants. Default is FALSE.- return_filtered_rows
A logical value. If TRUE, the function will return both the filtered mutation data and the records that were removed/flagged in a seperate data frame. The two dataframes will be returned inside a list, with names
mutation_data
andfiltered_rows
. Default is FALSE.
Examples
# Load example data
example_file <- system.file("extdata", "Example_files",
"example_mutation_data.rds",
package = "MutSeqR")
example_data <- readRDS(example_file)
# Filter the data
# Basic Usage: Filter out putative germline variants
filter_example_1 <- filter_mut(mutation_data = example_data,
vaf_cutoff = 0.01)
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Filtering complete.
# Remove rows outside of the TwinStand Mouse Mutagenesis Panel regions
filter_example_2 <- filter_mut(mutation_data = example_data,
vaf_cutoff = 0.01,
regions = "TSpanel_mouse",
regions_filter = "keep_within")
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Applying region filter...
#> Removed 22 rows based on regions.
#> Filtering complete.
# Apply a custom filter to flag rows with "EndRepairFillInArtifact"
# in the column 'filter'
filter_example_3 <- filter_mut(mutation_data = example_data,
vaf_cutoff = 0.01,
regions = "TSpanel_mouse",
regions_filter = "keep_within",
custom_filter_col = "filter",
custom_filter_val = "EndRepairFillInArtifact",
custom_filter_rm = FALSE)
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Applying custom filter...
#> Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
#> Applying region filter...
#> Removed 22 rows based on regions.
#> Filtering complete.
# Flag snv variants that overlap with germline mnv variants.
# Subtract the alt_depth of these variants from their total_depth
# (treat them as No-calls).
# Return all the flagged/removed rows in a seperate data frame
filter_example_4 <- filter_mut(mutation_data = example_data,
vaf_cutoff = 0.01,
regions = "TSpanel_mouse",
regions_filter = "keep_within",
custom_filter_col = "filter",
custom_filter_val = "EndRepairFillInArtifact",
custom_filter_rm = FALSE,
snv_in_germ_mnv = TRUE,
rm_filtered_mut_from_depth = TRUE,
return_filtered_rows = TRUE)
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Flagging SNVs overlapping with germline MNVs...
#> Found 20 SNVs overlapping with germline MNVs.
#> Applying custom filter...
#> Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
#> Applying region filter...
#> Removed 22 rows based on regions.
#> Removing filtered mutations from the total_depth...
#> Filtering complete.
#> Returning a list: mutation_data and filtered_rows.
# Flagging germline mutations...
# Found 612 germline mutations.
# Flagging SNVs overlapping with germline MNVs...
# Found 20 SNVs overlapping with germline MNVs.
# Applying custom filter...
# Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
# Applying region filter...
# Removed 22 rows based on regions.
# Correcting depth...
# 909 rows had their total_depth corrected.
# Removing filtered mutations from the total_depth...
# Filtering complete.
# Returning a list: mutation_data and filtered_rows.
filtered_rows <- filter_example_4$filtered_rows
filtered_example_mutation_data <- filter_example_4$mutation_data