Filter your mutation data — filter

This function creates a filter_mut`` column that will be read by the \code{calculate_mf} function and other downstream functions. Variants with filter_mut == TRUE“ will be excluded from group mutation counts. This function may also remove records upon on user specification. Running this function again on the same data will not overide the previous filters. To reset previous filters, set the filter_mut column values to FALSE.

Usage

filter_mut(
  mutation_data,
  vaf_cutoff = 1,
  snv_in_germ_mnv = FALSE,
  rm_abnormal_vaf = FALSE,
  custom_filter_col = NULL,
  custom_filter_val = NULL,
  custom_filter_rm = FALSE,
  regions = NULL,
  regions_filter,
  allow_half_overlap = FALSE,
  rg_sep = "\t",
  is_0_based_rg = TRUE,
  rm_filtered_mut_from_depth = FALSE,
  return_filtered_rows = FALSE
)

Arguments

mutation_data: Your mutation data. This can be a data frame or a GRanges object.
vaf_cutoff: Filter out ostensibly germline variants using a cutoff for variant allele fraction (VAF). Any variant with a vaf larger than the cutoff will be filtered. The default is 1 (no filtering). It is recommended to use a value of 0.01 (i.e. 1%) as a conservative approach to retain only somatic variants.
snv_in_germ_mnv: Filter out snv variants that overlap with germline mnv variants within the same samples. mnv variants will be considered germline if their vaf > vaf_cutoff. Default is FALSE.
rm_abnormal_vaf: A logical value. If TRUE, rows in mutation_data with a variant allele fraction (VAF) between 0.05 and 0.45 or between 0.55 and 0.95 will be removed. We expect variants to have a VAF ~0. 0.5, or 1, reflecting rare somatic mutations, heterozygous germline mutations, and homozygous germline mutations, respectively. Default is FALSE.
custom_filter_col: The name of the column in mutation_data to apply a custom filter to. This column will be checked for specific values, as defined by custom_filter_val. If any row in this column contains one of the specified values, that row will either be flagged in the filter_mut column or, if specified by custom_filter_rm, removed from mutation_data.
custom_filter_val: A set of values used to filter rows in mutation_data based on custom_filter_col. If a row in custom_filter_col matches any value in custom_filter_val, it will either be set to TRUE in the filter_mut column or removed, depending on custom_filter_rm.
custom_filter_rm: A logical value. If TRUE, rows in custom_filter_col that match any value in custom_filter_val will be removed from the mutation_data. If FALSE, filter_mut will be set to TRUE for those rows.
regions: Remove rows that are within/outside of specified regions. regions can be either a file path, a data frame, or a GRanges object containing the genomic ranges by which to filter. File paths will be read using the rg_sep. Users can also choose from the built-in TwinStrand's Mutagenesis Panels by inputting "TSpanel_human", "TSpanel_mouse", or "TSpanel_rat". Required columns for the regions file are "contig", "start", and "end". In a GRanges object, the required columns are "seqnames", "start", and "end".
regions_filter: Specifies how the provided regions should be applied to mutation_data. Acceptable values are "remove_within" or "keep_within". If set to "remove_within", records that fall within the specified regions wil be removed from mutation_data. If set to "keep_within", only records within the specified regions will be kept in mutation_data, and all other records will be removed.
allow_half_overlap: A logical value. If TRUE, records that start or end in your regions, but extend outside of them in either direction will be included in the filter. If FALSE, only records that start and end within the regions will be included in the filter. Default is FALSE.
rg_sep: The delimiter for importing the custom_regions. The default is tab-delimited "\t".
is_0_based_rg: A logical variable. Indicates whether the position coordinates in regions are 0 based (TRUE) or 1 based (FALSE). If TRUE, positions will be converted to 1-based (start + 1). Need not be supplied for TSpanels. Default is TRUE.
rm_filtered_mut_from_depth: A logical value. If TRUE, the function will subtract the alt_depth of records that were flagged by the filter_mut column from their total_depth. This will treat flagged variants as No-calls. This will not apply to variants flagged as germline by the vaf_cutoff. However, if the germline variant has additional filters applied, then the subtraction will still occur. If FALSE, the alt_depth will be retained in the total_depth for all variants. Default is FALSE.
return_filtered_rows: A logical value. If TRUE, the function will return both the filtered mutation data and the records that were removed/flagged in a seperate data frame. The two dataframes will be returned inside a list, with names mutation_data and filtered_rows. Default is FALSE.

Examples

# Load example data
example_file <- system.file("extdata", "Example_files",
                            "example_mutation_data.rds",
                            package = "MutSeqR")
example_data <- readRDS(example_file)
# Filter the data
# Basic Usage: Filter out putative germline variants
filter_example_1 <- filter_mut(mutation_data = example_data,
                               vaf_cutoff = 0.01)
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Filtering complete.
# Remove rows outside of the TwinStand Mouse Mutagenesis Panel regions
filter_example_2 <- filter_mut(mutation_data = example_data,
                               vaf_cutoff = 0.01,
                               regions = "TSpanel_mouse",
                               regions_filter = "keep_within")
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Applying region filter...
#> Removed 22 rows based on regions.
#> Filtering complete.
# Apply a custom filter to flag rows with "EndRepairFillInArtifact"
# in the column 'filter'
filter_example_3 <- filter_mut(mutation_data = example_data,
                               vaf_cutoff = 0.01,
                               regions = "TSpanel_mouse",
                               regions_filter = "keep_within",
                               custom_filter_col = "filter",
                               custom_filter_val = "EndRepairFillInArtifact",
                               custom_filter_rm = FALSE)
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Applying custom filter...
#> Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
#> Applying region filter...
#> Removed 22 rows based on regions.
#> Filtering complete.
# Flag snv variants that overlap with germline mnv variants.
# Subtract the alt_depth of these variants from their total_depth
# (treat them as No-calls).
# Return all the flagged/removed rows in a seperate data frame
filter_example_4 <- filter_mut(mutation_data = example_data,
                               vaf_cutoff = 0.01,
                               regions = "TSpanel_mouse",
                               regions_filter = "keep_within",
                               custom_filter_col = "filter",
                               custom_filter_val = "EndRepairFillInArtifact",
                               custom_filter_rm = FALSE,
                               snv_in_germ_mnv = TRUE,
                               rm_filtered_mut_from_depth = TRUE,
                               return_filtered_rows = TRUE)
#> Flagging germline mutations...
#> Found 612 germline mutations.
#> Flagging SNVs overlapping with germline MNVs...
#> Found 20 SNVs overlapping with germline MNVs.
#> Applying custom filter...
#> Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
#> Applying region filter...
#> Removed 22 rows based on regions.
#> Removing filtered mutations from the total_depth...
#> Filtering complete.
#> Returning a list: mutation_data and filtered_rows.
# Flagging germline mutations...
# Found 612 germline mutations.
# Flagging SNVs overlapping with germline MNVs...
# Found 20 SNVs overlapping with germline MNVs.
# Applying custom filter...
# Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
# Applying region filter...
# Removed 22 rows based on regions.
# Correcting depth...
# 909 rows had their total_depth corrected.
# Removing filtered mutations from the total_depth...
# Filtering complete.
# Returning a list: mutation_data and filtered_rows.
filtered_rows <- filter_example_4$filtered_rows
filtered_example_mutation_data <- filter_example_4$mutation_data