Imports tabular mutation file into the local R environment.
Usage
import_mut_data(
mut_file,
mut_sep = "\t",
is_0_based_mut = TRUE,
sample_data = NULL,
sd_sep = "\t",
regions = NULL,
rg_sep = "\t",
is_0_based_rg = TRUE,
padding = 0,
genome = NULL,
species = NULL,
masked_BS_genome = FALSE,
custom_column_names = NULL,
output_granges = FALSE
)
Arguments
- mut_file
The mutation data file(s) to be imported. This can be either a data frame object or a filepath to a file or directory. If you specify a directory, the function will attempt to read all files in the directory and combine them into a single data frame. Mutation data should consist of a row for each variant. Required columns are listed in details.
- mut_sep
The delimiter for importing the mutation file. Default is tab-delimited.
- is_0_based_mut
A logical variable. Indicates whether the position coordinates in the mutation data are 0 based (TRUE) or 1 based (FALSE). If TRUE, positions will be converted to 1-based.
- sample_data
An optional file containing additional sample metadata (dose, timepoint, etc.). This can be a data frame or a file path. Metadata will be joined with the mutation data based on the sample column. Required columns are
sample
and any additional columns you wish to include.- sd_sep
The delimiter for importing sample data. Default is tab-delimited.
- regions
An optional file containing metadata of genomic regions. Region metadata will be joined with mutation data and variants will be checked for overlap with the regions.
regions
can be either a file path, a data frame, or a GRanges object. File paths will be read using the rg_sep. Users can also choose from the built-in TwinStrand's Mutagenesis Panels by inputting "TSpanel_human", "TSpanel_mouse", or "TSpanel_rat". Required columns for the regions file are "contig", "start", and "end". For a GRanges object, the required columns are "seqnames", "start", and "end". Default is NULL.- rg_sep
The delimiter for importing the custom_regions. The default is tab-delimited "\t".
- is_0_based_rg
A logical variable. Indicates whether the position coordinates in
regions
are 0 based (TRUE) or 1 based (FALSE). If TRUE, positions will be converted to 1-based (start + 1). Need not be supplied for TSpanels. Default is TRUE.- padding
An integer >= 0. Extend the range of your regions in both directions by the given amount. Ex. Structural variants and indels may start outside of the regions. Adjust the padding to include these variants in your region's ranges.
- genome
The genome assembly version of the reference genome. This is required if your data does not include a context column. The function will install a BS genome for the given species/genome/masked arguments to populate the context column. Ex.Human GRCh38 = hg38 | Human GRCh37 = hg19 | Mouse GRCm38 = mm10 | Mouse GRCm39 = mm39 | Rat RGSC 6.0 = rn6 | Rat mRatBN7.2 = rn7
- species
The species. Required if your data does not include a context column. The function will install a BS genome for the given species/genome/masked to populate the context column. The species can be the common name of the species or the scientific name. Ex. "human" or "Homo sapiens".
- masked_BS_genome
A logical value. Required when using a BS genome to poulate the context column. Whether to use the masked version of the BS genome (TRUE) or not (FALSE). Default is FALSE.
- custom_column_names
A list of names to specify the meaning of column headers. Since column names can vary with data, this might be necessary to digest the mutation data properly. Typical defaults are set, but can be substituted in the form of
list(my_custom_contig_name = "contig", my_custom_sample_column_name = "sample")
. You can change one or more of these. Set column synonyms are defined in MutSeqR::op$column and will automatically be changed to their default value.- output_granges
A logical variable; whether you want the mutation data to output as a GRanges object. Default output (FALSE) is as a dataframe.
Value
A table where each row is a mutation, and columns indicate the
location, type, and other data. If output_granges
is set to TRUE, the
mutation data will be returned as a GRanges object, otherwise mutation
data is returned as a dataframe.
Output Column Definitions:
short_ref
: The reference base at the start position.normalized_ref
: The short_ref in C/T-base notation for this position (e.g. A -> T, G -> C).context
The trinucleotide context at this position. Consists of the reference base and the two flanking bases (e.g. TAC).normalized_context
: The trinucleotide context in C/T base notation for this position (e.g. TAG -> CTA).variation_type
The type of variant (snv, mnv, insertion, deletion, complex, sv, no_variant, ambiguous, uncategorized).subtype
The substitution type for the snv variant (12-base spectrum; e.g. A>C).normalized_subtype
The C/T-based substitution type for the snv variant (6-base spectrum; e.g. A>C -> T>G).context_with_mutation
: The substitution type for the snv variant including the two flanking nucleotides (192-trinucleotide spectrum; e.g.T[A>C]G
)normalized_context_with_mutation
: The C/T-based substitution type for the snv variant including the two flanking nucleotides (96-base spectrum e.g.T[A>C]G
->C[T>G]A
).nchar_ref
: The length (in bp) of the reference allele.nchar_alt
: The length (in bp) of the alternate allele.varlen
: The length (in bp) of the variant.ref_depth
: The depth of the reference allele. Calculated astotal_depth
-alt_depth
, if applicable.vaf
: The variant allele fraction. Calculated asalt_depth
/total_depth
.gc_content
: % GC of the trinucleotide context at this position.is_known
: TRUE or FALSE. Flags known variants (ID != ".").row_has_duplicate
: TRUE or FALSE. Flags rows whose position is the same as that of at least one other row for the same sample.filter_mut
: A logical value, initially set to FALSE that indicates to calculte_mf() if the variant should be excluded from mutation counts. See the filter_mut function for more detail.
Details
Required columns for mut files are:
contig
: The name of the reference sequence.start
: The start position of the feature.end
: The half-open end position of the feature.sample
: The sample name.ref
: The reference allele at this positionalt
: The left-aligned, normalized, alternate allele at this position. Multiple alt alleles called for a single position should be represented as separate rows in the table.
The following columns are not required, but are recommended for full package functionality:
alt_depth
: The read depth supporting the alternate allele. If not included, the function will add this column, assuming a value of 1.total_depth
: The total read depth at this position, excluding no-calls (N calls). If not present, the function will attempt to calculate thetotal_depth
asdepth
-no_calls
. If no_calls is not present, the function will usedepth
as thetotal_depth.
depth
: The total read depth at this position, including no-calls.no_calls
: The number of no-calls (N-calls) at this position.
We recommend that files include a record for every sequenced
position, regardless of whether a variant was called, along with the
total_depth
for each record. This enables site-specific depth calculations
required for some downstream analyses.
Examples
# Example: Import a single mutation file. This library was sequenced with
# Duplex Sequencing using the TwinStrand Mouse Mutagenesis Panel which
# consists of 20 2.4kb targets = 48kb of sequence.
example_file <- system.file("extdata", "Example_files",
"example_import_mut_data.rds",
package = "MutSeqR")
example_data <- readRDS(example_file)
# We will create an example metadata table for this data.
sample_meta <- data.frame(sample = "dna00996.1",
dose = "50",
dose_group = "High")
# Import the data
imported_example_data <- import_mut_data(mut_file = example_data,
sample_data = sample_meta,
regions = "TSpanel_mouse",
genome = "mm10",
species = "mouse",
masked_BS_genome = FALSE)
#> Joining with `by = join_by(sample)`
#> 'getOption("repos")' replaces Bioconductor standard repositories, see
#> 'help("repositories", package = "BiocManager")' for details.
#> Replacement repositories:
#> CRAN: https://cran.rstudio.com
#> Reference genome already installed.
#> Loading reference genome: BSgenome.Mmusculus.UCSC.mm10.
#> Retrieving context sequences from BSgenome
#> Warning: 150 rows were found whose position was the same as that of at least one other row for the same sample.
#> Warning: The total_depth may be double-counted in some instances due to overlapping positions. Set the correct_depth parameter in calculate_mf() to correct the total_depth for these instances.