Skip to contents

Introduction

What is ECS?

Error-corrected next-generation sequencing (ECS) uses various methods to combine multiple independent raw sequence reads derived from an original starting molecule, thereby subtracting out artifacts introduced during sequencing or library preparation. This results in a highly accurate representation of the original molecule. ECS is particularly useful for detecting rare somatic mutations (or mutations induced in germ cells), such as those that arise from mutagen exposure or other sources of DNA damage. ECS is a powerful tool for assessing the mutagenicity of chemicals, drugs, or other agents, and can be used to identify the mutational signatures of these agents. ECS can also be used to detect rare mutations in cancer or other diseases, and to track the clonal evolution of these diseases over time.

For more background on how ECS works and its context in regulatory toxicology testing and genetic toxicology, see the following articles:

  • Menon and Brash (2023)
  • Marchetti, Cardoso, Chen, Douglas, Elloway, Escobar, Harper, et al. (2023)
  • Marchetti, Cardoso, Chen, Douglas, Elloway, Escobar, Harper Jr, et al. (2023)
  • Kennedy et al. (2014)

This R package is meant to facilitate the import, cleaning, and analysis of ECS data, beginning with a table of variant calls or a variant call file (VCF). The package is designed to be flexible and enable users to perform common statistical analyses and visualisations.

Installation

Install the package from github:

install.packages("devtools")

devtools::install_github(
  "EHSRB-BSRSE-Bioinformatics/MutSeqR",
  auth_token = "your personal_access_token from GitHub"
)

Install the package from Bioconductor (currently not implemented):

if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("MutSeqR")

Load the package

Data import

The main goal of MutSeqR is to generate summary statistics, visualizations, exploratory analyses, and other post-processing tasks such as mutational signature analysis or generalized linear modeling. Mutation data should be supplied as a table of variants with their genomic positions. Mutation data can be imported as either VCF files or as tabular data using the functions import_vcf_data and import_mut_data, respectively. It is reccomended that files include a record for every sequenced position, regardless of whether a variant was called or not, along with the total_depth for each record. This enables site-specific depth calculations that are required for the calculation of mutation subtype frequencies ad other site-specific frequemcies. The data set can be pared down later to include only mutations of interest (SNVs, indels, SVs, or any combination).

(#tab:required-columns) Required columns for mutation data import.
Column VCF Specification Definition
contig CHROM The name of the reference sequence.
start POS The start position of the feature. 0-based coordinates are accepted but will be changed to 1-based during import.
end INFO(END) The half-open end position of the feature in contig.
ref REF The reference allele at this position.
alt ALT The left-aligned, normalized, alternate allele at this position.
sample INFO(sample) or Genotype Header A unique identifier for the sample library. For VCF files, this field may be provided in either the INFO field, or as the header to the GENOTYPE field.
SUGGESTED FIELDS
alt_depth VD The read depth supporting the alternate allele. If not included, the function will assume an alt_depth of 1 at variant sites.
total_depth or depth AD or DP The total read depth at this position. This column can be “total_depth” which excludes N-calls, or “depth”, which includes N-calls, if “total_depth” is not available. For VCF files, the total_depth is calculated as the sum of AD. DP is equivalent to “depth”.

VCF files should follow the VCF specification (version 4.5; Danecek et al. (2011)). VCF files may be bg/g-zipped. Each individual VCF file should contain the mutation data for a single sample library. Multiple alt alleles called for a single position should be represented as separate rows in the data. All extra columns, INFO fields, and FORMAT fields will be retained upon import.

Upon import, records are categorized within the variation_type column based on their REF and ALT. Categories are listed below.

(#tab:variation-types) Definitions of Variation types.
variation_type Definition
no_variant No variation, the null-case.
snv Single nucleotide variant.
mnv Multiple nucleotide variant.
insertion Insertion.
deletion Deletion.
complex REF and ALT are of different lengths and nucleotide compositions.
symbolic Structural variant
ambiguous ALT contains IUPAC ambiguity codes.
uncategorized The record does not fall into any of the preceding categories.

Additional columns are created to further characterise variants.

(#tab:mut-columns) Definitions of Mutation data columns.
Column Name Definition
short_ref The reference base at the start position.
normalized_ref The short_ref in C/T (pyrimidine) notation for this position. Ex. A -> T, G -> C
context The trinucleotide context at this position. Consists of the reference base and the two flanking bases. Sequences are retrieved from the appropriate BS genome. Ex. TAC
normalized_context The trinucleotide context in C/T (pyrimidine) notation for this position (Ex. TAG -> CTA)
variation_type The type of variant (no_variant, snv, mnv, insertion, deletion, complex, sv, ambiguous, uncategorized)
subtype The substitution type of the snv variant (12-base spectrum; Ex. A>C)
normalized_subtype The snv subtype in C/T (pyrimidine) notation (6-base spectrum; Ex. A>C -> T>G)
context_with_mutation The snv subtype including the two flanking nucleotides (192-base spectrum; Ex. T[A>C]G)
normalized_context_with_mutation The snv subtype in C/T (pyrimidine) notation including the two flanking nucleotides (96-base spectrum; Ex. T[A>C]G -> C[T>G]A)
nchar_ref The length (in bp) of the reference allele.
nchar_alt The length (in bp) of the alternate allele.
varlen The length (in bp) of the variant.
ref_depth The depth of the reference allele. Calculated as total_depth - alt_depth, if applicable.
vaf The variant allele fraction. Calculated as alt_depth/total_depth
gc_content % GC of the trinucleotide context at this position.
is_known A logical value indicating if the record is a known variant; i.e. ID field is not NULL.
row_has_duplicate A logical value that flags rows whose position is the same as that of at least one other row for the same sample.

General Usage

Indicate the filepath to your mutation data file(s) using the vcf_file/ mut_file parameter. This can be either a single file or a directory containing multiple files. Should you provide a directory, then all files within will be bound into a single data frame.

An important component of importing your data for proper use is to assign each mutation to a biological sample, and also make sure that some additional information about each sample is present (ex., a chemical treatment, a dose, etc.). This is done by providing sample_data. This parameter can take a data frame, or it can read in a file if provided with a filepath. If using a filepath, specify the proper delimiter using the sd_sep parameter. Sample metadata will be joined with the mutation data using the “sample” column to capture that information and associate it with each mutation.

Specify the appropriate BS genome with which to populate the context column by supplying the species, genome, and masked_BS_genome parameters. The function will browse BSgenome::available.genomes for the appropriate reference genome and install the corresponding package. Context information will be extracted from the installed BSgenome object. BSgenome offers genomes with masked sequences. If you wish to use the masked version of the genome, set masked_BS_genome to TRUE.

Examples

We provide an example data set taken from LeBlanc et al. (2022). This data consists of 24 mouse bone marrow samples sequenced with Duplex Sequencing using Twinstrand’s Mouse Mutagenesis Panel of twenty 2.4kb targeted genomic loci. Mice were exposed to three doses of benzo[a]pyrene (BaP) alongside vehicle controls, n = 6.

VCF

Example 1.1. Import the example .vcf.bgz file. Provided is the genomic vcf.gz file for sample dna00996.1. It is comprised of a record for all 48K positions sequenced for the Mouse Mutagenesis Panel with the alt_depth and the tota_depth values for each record.

example_file <- system.file(
  "extdata",
  "Example_files",
  "example_import_vcf_data_cleaned.vcf.bgz",
  package = "MutSeqR"
)
sample_metadata <- data.frame(
  sample = "dna00996.1",
  dose = "50",
  dose_group = "High"
)
# Import the data
imported_example_data <- import_vcf_data(
  vcf_file = example_file,
  sample_data = sample_metadata,
  genome = "mm10",
  species = "mouse",
  masked_BS_genome = FALSE
)

Data Table 1. Mutation data imported from VCF for sample dna00996.1. Displays the first 6 rows.

Tabular

Example 1.2. Import example tabular data. This is the equivalent file to the example vcf file. It is stored as an .rds file. We will load the dataframe and supply it to import_mut_data. The mut_file parameter can accept file paths or data frames as input.

example_file <- system.file(
  "extdata",
  "Example_files",
  "example_import_mut_data.rds",
  package = "MutSeqR"
)
example_data <- readRDS(example_file)

sample_metadata <- data.frame(
  sample = "dna00996.1",
  dose = "50",
  dose_group = "High"
)
# Import the data
imported_example_data <- import_mut_data(
  mut_file = example_data,
  sample_data = sample_metadata,
  genome = "mm10",
  species = "mouse",
  masked_BS_genome = FALSE,
  is_0_based_mut = TRUE # indicates that the genomic coordinates are 0-based.
# Coordinates will be changed to 1-based upon import.
)

Data Table 2. Mutation data imported from a data.frame object for sample dna00996.1. Displays the first 6 rows.

Region Metadata

Similar to sample metadata, you may supply a file containing the metadata of genomic regions to the regions parameter. In the file, the genomic regions are defined by the name of their reference sequence plus their start and end coordinates. Additional columns are added to describe features of the genomic regions (Ex. transcription status, GC content, etc.). Region metadata will be joined with mutation data by checking for overlap between the target region ranges and the position of the record. The regions parameter can be either a filepath, a data frame, or a GRanges object (see GenomicRanges). Filepaths will be read using the rg_sep. Users can also choose from the built-in TwinStrand DuplexSeq™ Mutagenesis Panels by inputting “TSpanel_human”, “TSpanel_mouse”, or “TSpanel_rat”. Required columns for the regions file are “contig”, “start”, and “end”. In a GRanges object, the required columns are “seqnames”, “start”, and “end”. Users must indicate whether the region coordinates are 0-based or 1-based with is_0_based_rg. Mutation data and region coordinates will be converted to 1-based. If you do not wish to specify regions, then set the regions parameter to NULL (default).

Example 1.3. Add the metadata for TwinStrand’s Mouse Mutagenesis panel to our example tabular file.

imported_example_data <- import_mut_data(
  mut_file = example_data,
  sample_data = sample_metadata,
  genome = "mm10",
  species = "mouse",
  masked_BS_genome = FALSE,
  is_0_based_mut = TRUE,
  regions = "TSpanel_mouse"
)

Data Table 3. Mutation data imported from a data.frame object for sample dna00996.1 with added metadata for the Mouse Mutagenesis Panels sequencing targets. Displays the first 6 rows. Added columns are ‘target_size’, ‘label’, ‘genic_context’, ‘region_GC_content’, ‘genome’, and ‘in_regions’.

Users can load the example TwinStrand Mutagenesis Panels with load_regions(). This function will output a GRanges object. Custom panels may also be loaded with this function by providing their file path to the regions parameter.

region_example <- load_regions_file("TSpanel_mouse")
region_example
## GRanges object with 20 ranges and 5 metadata columns:
##        seqnames              ranges strand | target_size       label
##           <Rle>           <IRanges>  <Rle> |   <integer> <character>
##    [1]     chr1   69304218-69306617      * |        2400        chr1
##    [2]     chr1 155235939-155238338      * |        2400      chr1.2
##    [3]     chr2   50833176-50835575      * |        2400        chr2
##    [4]     chr3 109633161-109635560      * |        2400        chr3
##    [5]     chr4   96825281-96827680      * |        2400        chr4
##    ...      ...                 ...    ... .         ...         ...
##   [16]    chr15   66779763-66782162      * |        2400       chr15
##   [17]    chr16   72381581-72383980      * |        2400       chr16
##   [18]    chr17   94009029-94011428      * |        2400       chr17
##   [19]    chr18   81262079-81264478      * |        2400       chr18
##   [20]    chr19     4618814-4621213      * |        2400       chr19
##        genic_context region_GC_content      genome
##          <character>         <numeric> <character>
##    [1]    intergenic              37.3        mm10
##    [2]         genic              54.0        mm10
##    [3]    intergenic              45.3        mm10
##    [4]         genic              39.2        mm10
##    [5]    intergenic              39.4        mm10
##    ...           ...               ...         ...
##   [16]         genic              44.0        mm10
##   [17]    intergenic              38.3        mm10
##   [18]    intergenic              35.2        mm10
##   [19]    intergenic              47.3        mm10
##   [20]         genic              56.1        mm10
##   -------
##   seqinfo: 19 sequences from an unspecified genome; no seqlengths

Custom Column Names

We recognize that column names may differ from those specified in Table @ref(tab:required-columns). Therefore, we have implemented some default column name synonyms. If your column name matches one of our listed synonyms, it will automatically be changed to match our set values. For example, your contig column may be named chr or chromosome. After importing your data, this synonymous column name will be changed to contig. Column names are case-insensitive. A list of column name synonyms are listed below.

(#tab:name-sym) Predefined column name synonyms. Synonyms will be automatically changed to the default Column value upon import.
Column Synonyms
alt alternate
alt_depth alt_read_depth, var_depth, variant_depth, vd
context sequence_context, trinucleotide_context, flanking_sequence
contig chr, chromosome, seqnames
depth dp
end end_position
no_calls n_calls, n_depth, no_depth
ref reference, ref_value
sample sample_name, sample_id
start pos, position
total_depth informative_somatic_depth

If your data contains a column that is synonymous to one of the required columns, but the name is not included in our synonyms list, your column name may be substituted using the custom_column_names parameter. Provide this parameter with a list of names to specify the meaning of column headers.

mut_data <- system.file(
  "extdata", "Example_files",
  "example_import_mut_data_custom_col_names.txt",
  package = "MutSeqR"
)
imported_example_data_custom <- import_mut_data(
  mut_file = mut_data,
  custom_column_names = list(my_contig_name = "contig",
                             my_sample_name = "sample")
)
## Expected 'contig' but found 'my_contig_name', renaming it.
## Expected 'sample' but found 'my_sample_name', renaming it.

Data Table 4. Mutation data imported from tabular file with custom column names. Displays the first 6 rows. Custom column names are changed to default values.

Output

Mutation data can be output as either a data frame or a GRanges object (see GenomicRanges) for downstream analysis. Use the output_granges parameter to specify the output. GRanges may faciliate use in other packages and makes doing genomic based analyses on the ranges significantly easier. Most downstream analyses provided by MutSeqR will use a data frame.

Variant Filtering

Following data import, the mutation data may be filtered using the filter_mut() function. This function will flag variants based on various parameters in the filter_mut column. Variants flagged as TRUE in the filter_mut column will be automatically excluded from downstream analyses such as calculate_mf(), plot_bubbles, and signature_fitting. When specified, this function may also remove records from the mutation data.

Flagging the variants in the filter_mut column will not remove them from the mutation data, however, they will be excluded from mutation counts in all downstream analyses. Filtered variants are retained in the data so that their total_depth values may still be used in frequency calculations.

By default, all filtering parameters are disabled. Users should be mindful of the filters that they use, ensuring first that they are applicable to their data.

Germline Variants

Germline variants may be identified and flagged for filtering by setting the vaf_cutoff parameter. The variant allele fraction (vaf) is the fraction of haploid genomes in the original sample that harbor a specific mutation at a specific base-pair coordinate of the reference genome. Specifically, is it calculated by dividing the number of variant reads by the total sequencing depth at a specific base pair coordinate (alt_depth / total_depth). In a typical diploid cell, a homozygous germline variant will appear on both alleles, in every cell. As such, we expect this variant to occur on every read, giving us a vaf = 1. A heterozygous germline variant occurs on one of the two alleles in every cell, as such we expect this variant to occur on about half of the reads, giving a vaf = 0.5. Somatic variants occur in only a small portion of the cells, thus we expect them to appear in only a small percentage of the reads. Typical vaf values for somatic variants are less than 0.01 - 0.1. Setting the vaf_cutoff parameter will flag all variants that have a vaf greater than this value as germline within the is_germline column. It will also flag these variants in the filter_mut so as to exclude them from downstream analyses.

vaf germline filtering is only applicable to users whose data was sequenced to a sufficient depth. High coverage/low depth sequencing cannot be used to calculate the vaf, thus it is recommended to filter germline mutations by contrasting against a database of known polymorphisms, or by using conventional whole-genome sequencing to identify germline variants for each sample.

Quality Assurance

The filter_mut() function offers some filtering options to ensure the quality of the mutation data.

snv_in_germ_mnv = TRUE will flag all snvs that overlap with germline mnvs. Germline mnvs are defined as having a vaf > vaf_cutoff. These snvs are often artifacts from variant calling. No-calls in reads supporting the germline mnv will create false minor-haplotypes from the original mnv that can appear as sub-clonal snvs, thus such variants are excluded from downstream analyses.

rm_abnormal_vaf = TRUE This parameter identifies rows with abnormal vaf values and removes them from the mutation data. Abnormal vaf values are defined as being between 0.05-0.45 or between 0.55-0.95. In a typical diploid organism, we expect variants to have a vaf ~0, 0.5, or 1, reflecting rare somatic variants, heterozygous variants, and homozygous variants respectively. Users should be aware of the ploidy of their model system when using this filter. Non-diploid organisms may exhibit different vafs.

rm_filtered_mut_from_depth = TRUE Variants flagged in the filter_mut column will have their alt_depth subtracted from the total_depth. When TRUE, this parameter treats flagged variants as No-calls. This does not apply to variants that were idenfied as germline variants.

Custom Filtering

Users may use the filter_mut() function to flag or remove variants based on their own custom column. Any record that contains the custom_filter_val value within the custom_filter_col column of the mutation data will be either flagged in the filter_mut column or, if specified by the custom_filter_rm parameter, removed from the mutation data.

Filtering by Regions

Users may remove rows that are either within or outside of specified genomic regions. Provide the region ranges to the regions parameter. This may be provided as either a filepath, data frame, or a GRanges object. regions must contain contig (or seqnames for GRanges), start, and end. The function will check whether each record falls within the given regions. Users can define how this filter should be used with regions_filter. region_filter = "remove_within" will remove all rows whose positions overlap with the provided regions. region_filter = "keep_within" will remove all rows whose positions are outside of the provided regions. By default, records that are > 1bp must start and end within the regions to count as being within the region. allow_half_overlap = TRUE allow records that only start or end within the regions but extend outside of them to be counted as being within the region. Twinstrand Mutagenesis Panels may be used by setting regions to one of “TSpanel_human”, “TSpanel_mouse”, or “TSpanel_rat”.

Output

The function will return the mutation data as a data.frame object with rows removed or flagged depending on input parameters. Added columns are ‘filter_mut’, ‘filter_reason’, ‘is_germline’, and ‘snv_mnv_overlaps’. If return_filtered_rows = TRUE The function will return both the data.frame containing the processed mutation data and a data.frame containing the rows that were removed/flagged during the filtering process. The two dataframes will be returned inside of a list, with names mutation_data and filtered_rows. The function will also print the number of mutations/rows that were filtered according to each parameter.

Example

Example 2. The example file “example_mutation_data.rds” is the output of import_mut_data() run on all 24 mouse libraries from LeBlanc et al. (2022).

Filters used:

  • Filter germline variants: vaf < 0.01
  • Filter snvs overlapping with germline variants and have their alt_depth removed from their total_depth.
  • Remove records outside of the TwinStrand Mouse Mutagenesis Panel.
  • Filter variants that contain “EndRepairFillInArtifact” in the “filter” column. Their alt_depth will be removed from their total_depth.
# load the example data
example_file <- system.file(
  "extdata", "Example_files",
  "example_mutation_data.rds",
  package = "MutSeqR"
)
example_data <- readRDS(example_file)

# Filter
filtered_example_mutation_data <- filter_mut(
  mutation_data = example_data,
  vaf_cutoff = 0.01,
  regions = "TSpanel_mouse",
  regions_filter = "keep_within",
  custom_filter_col = "filter",
  custom_filter_val = "EndRepairFillInArtifact",
  custom_filter_rm = FALSE,
  snv_in_germ_mnv = TRUE,
  rm_filtered_mut_from_depth = TRUE,
  return_filtered_rows = FALSE
)
## Flagging germline mutations...
## Found 612 germline mutations.
## Flagging SNVs overlapping with germline MNVs...
## Found 20 SNVs overlapping with germline MNVs.
## Applying custom filter...
## Flagged 2021 rows with values in <filter> column that matched EndRepairFillInArtifact
## Applying region filter...
## Removed 22 rows based on regions.
## Removing filtered mutations from the total_depth...
## Filtering complete.

Data Table 5. Filtered Mutation data for all 24 samples. Displayed are 6 variants that were flagged during filtering.

Calculating MF

Mutation Frequency (MF) is calculated by dividing the sum of mutations by the sum of the total_depth for a given group (units mutations/bp). The function calculate_mf() sums mutation counts and total_depth across user-defined groupings within the mutation data to calculate the MF. Mutations can be summarised across samples, experimental groups, and mutation subtypes for later statistical analyses. Variants flagged as TRUE in the filter_mut column will be excluded from the mutation sums; however, the total_depth of these variants will be counted in the total_depth sum.

If the total_depth is not available, the function will sum the mutations across groups, without calculating the mutation frequencies.

Mutation Counting

Mutations are counted based on two opposing assumptions (Dodge et al. 2023):

The Minimum Independent Mutation Counting Method (min) Each mutation is counted once, regardless of the number of reads that contain the non-reference allele. This method assumes that multiple instances of the same mutation within a sample are the result of clonal expansion of a single mutational event. When summing mutations across groups using the Min method (sum_min), the alt_depth of each variant is set to 1. Ex. For 3 variants with alt_depth values of 1, 2, and 10, the sum_min= 3.

The Maximum Independent Mutation Counting Method (max) Multiple identical mutations at the same position within a sample are counted as independent mutation events. When summing mutations across groups using the Max method (sum_max), the alt_depth of each variant is summed unchanged. Ex. for 3 variants with alt_depth values of 1, 2, and 10, the sum_max = 13.

The Min and Max mutation counting methods undercount and overcount the mutations, respectively. We expect some recurrent mutations to be a result of clonal expansion. We also expect some recurrent mutations to arise independently of each other. As we expect the true number of independent mutations to be somewhere in the middle of the two counting methods, we calculate frequencies for both methods. However, the Min mutation counting method is generally recommended for statistical analyses to ensure independance of values and because the Max method tends to increase the sample variance by a significant degree.

Grouping Mutations

Mutation counts and total_depth are summed across groups that can be designated using the cols_to_group parameter. This parameter can be set to one or more columns in the mutation data that represent experimental variables of interest.

Example 3.1. Calculate mutation sums and frequencies per sample. The file “example_mutation_data_filtered.rds” is the output of filter_mut() from Example 2

# load example data:
example_file <- system.file(
  "extdata", "Example_files",
  "example_mutation_data_filtered.rds",
  package = "MutSeqR"
)
example_data <- readRDS(example_file)

mf_data_global <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "none",
  retain_metadata_cols = c("dose_group", "dose")
)

Table 6. Global Mutation Frequency per Sample

Alternatively, you can sum mutations by experimental groups such as label (the genomic target). Counts and frequencies will be returned for every level of the designated groups.

Example 3.2. Calculate mutation sums and frequencies per sample and genomic target.

mf_data_rg <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = c("sample", "label"),
  subtype_resolution = "none",
  retain_metadata_cols = "dose_group"
)

Table 7. Mutation Frequency per Sample and genomic target.

calculate_mf() does not calculate the mean MF for any given group If you want to calculate the mean MF for a given experimental variable, you may group by “sample” and retain the experimental variable in the summary table for averaging using the retain_metadata_cols parameter.

Example 3.3. Calculate the mean MF per dose

mf_data_global <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "none",
  retain_metadata_cols = c("dose_group", "dose")
)
mean_mf <- mf_data_global %>%
  dplyr::group_by(dose_group) %>%
  dplyr::summarise(mean_mf_min = mean(mf_min),
                   SE = sd(mf_min) / sqrt(dplyr::n()))

Table 8. Mean Mutation Frequency per Dose Group.

Mutation Subtypes

Mutations can also be grouped by mutation subtype at varying degrees of resolution using the subtype_resolution parameter.

(#tab:subtypes) Definitions of mutation subtype resolutions.
Subtype resolutions Definition Example
type The variation type snv, mnv, insertion, deletion, complex, and symbolic variants
base_6 The simple spectrum; snv subtypes in their pyrimidine context C>A, C>G, C>T, T>A, T>C, T>G
base_12 The snv subtypes A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G
base_96 The trinculeotide spectrum; the snv subtypes in their pyrimidine context alongside their two flanking nucleotides A[C>T]A
base_192 The snv subtypes reported alongside their two flanking nucleotides A[G>A]A

Mutations and total_depth will be summed across groups for each mutation subtype to calculate frequencies. For SNV subtypes, the total_depth is summed based on the sequence context in which the SNV subtype occurs (subtype_depth). In the simplest example, for the base_6 SNV subtypes, the two possible reference bases are C or T; hence, the total_depth is summed separately for C:G positions and T:A positions. Thus, the MF for C>T mutations is calculated as the total number of C>T mutations divided by total_depth for C:G positions within the group: sum / subtype_depth. Non-snv mutation subtypes, such as mnvs, insertions, deletions, complex variants, and structural variants, will be calculated as their sum / group_depth, since they can occur in the context of any nucleotide.
Upon import of mutation data, columns are created that facilitate the grouping of SNV subtypes and their associated sequence context for the various resolutions. Below the columns associated with each subtype_resolution are defined:

(#tab:subtypes-cols) Subtype Resolutions and their associated subtype/context columns.
Subtype resolution Subtype Column Context Column
type variation_type NA
base_6 normalized_subtype normalized_ref
base_12 subtype short_ref
base_96 normalized_context_with_mutation normalized_context
base_192 context_with_mutation context

The function will also calculate the the proportion of mutations for eachsubtype, normalized to the total_depth:
Ps=(MsDs)s(MsDs)P_s = \frac{\left(\frac{M_s}{D_s}\right)}{\sum_s \left(\frac{M_s}{D_s}\right)}
Where PsP_s is the normalized mutation proportion for subtype ss. MsM_s is the group mutation sum for subtype ss. DsD_s is the group sum of the subtype_depth for subtype ss.
If total_depth is not available for the mutation data, calculate_mf() will return the subtype mutation counts per group. It will also calculate subtype proportions, without normalizing to the total_depth:
Ps=MsMtotalP'_s = \frac{M_s}{M_{total}}
Where, PsP'_s is the non-normalized mutation proportion of subtype ss. MsM_s is the group mutation sum for subtype ss. MtotalM_{total} is the total mutation sum for the group.

Example 3.4. The following code will return the base_6 mutation spectra for all samples with mutation proportions normalized to depth.

mf_data_6 <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "base_6"
)

Table 9. Frequency and Proportions of 6-base Mutation Subtypes per Sample.

Selecting Variation Types

calculate_mf() can be used on a user-defined subset of variation_type values. The variant_types parameter can be set to a character string of variation_type values that the user wants to include in the mutation counts. When calculating group mutation sums, only variants of the specified variation_types will be counted. The total_depth for records of excluded variation_types will still be included in the group_depth and the subtype_depth, if applicable.

By default the function will calculate summary values based on all mutation types.

Example 3.5. Calculate global mutation frequencies per sample including only insertion and deletion mutations in the count.

mf_data_global_indels <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "none",
  variant_types = c("insertion", "deletion")
)

Table 10. Indel Mutation Frequency per Sample.

Users may also supply a list of variation_types to exclude to the variant_types parameter, so long as the value is preceeded by a “-”.

Example 3.6. Calculate mutation frequencies at the “type” subtype resolution, excluding ambiguous and uncategorized mutations.

mf_data_types <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "type",
  variant_types = c("-ambiguous", "-uncategorized")
)

Table 11. Frequency and Proportions of Variation Types per Sample.

Example 3.7. Include only snv mutations at the base_96 resolution

mf_data_96 <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "base_96",
  variant_types = "snv"
)

Table 12. Frequency and Proportions of 96-base Trinucleotide Subtypes per Sample.

Correcting Depth

There may be cases when the same genomic position is represented multiple times within the mutation data of a single sample. For instance, we require that multiple different alternate alleles for a single position be reported as seperate rows. Another common example of this that we have observed are deletion calls. Often, for each deletion called, a no_variant call at the same start position will also be present. For data that includes the total_depth of each record, we want to prevent double-counting the depth at these positions during when summing the depth across groups. When set to TRUE, correct_depth parameter will correct the depth at these positions. For positions with the same sample, contig, and start values, the total_depth will be retained for only one row. All other rows in the group will have their total_depth set to 0 before beign summed. The import functions will automatically check for duplicated rows and return a warning advising to correct the depth should it find any in the mutation data. This is recommended for all users whose data contain total_depth values. By default, correct_depth is set to TRUE.

NOTE: The case of deletions and no_variants: When identifying a deletion, some variant callers will re-align the sequences. As such a second total_depth value will be calculated, specifically for the deletion. The variant caller will then provide both the no_variant and the deletion call, the former with the initial depth, and the latter with the re-aligned depth value. Generally, when correcting the depth, the function will retain the total_depth of the first row in the group, and set the rest to 0. However, in the case of deletions, the function may prioritize retaining the re-aligned total_depth over the total_depth assigned to the no_variant. This prioritization can be activated with the correct_depth_by_indel_priority parameter.

Precalculated Depth

For mutation data that does not include a total_depth value for each sequenced site, precalculated depths for the specified groups can be supplied separately to calculate_mf() through the precalc_depth_data parameter. Depth should be provided at the correct subtype resolution for each level of the specified grouping variable. This parameter will accept either a data frame or a filepath to import the depth data. Required columns are:

  1. The user-defined cols_to_group variable(s)
  2. group_depth: the total_depth summed across the cols_to_group.
  3. The context column for the specified subtype_resolution. Only applicable if using the SNV resolutions (base_6, base_12, base_96, base_192). Column names are listed in the table above.
  4. subtype_depth: the total_depth summed across the cols_to_group for each context. Only applicable if using the SNV resolutions.

Example 3.8. Use precalculated depth values to calculate the global per sample MF.

Table 13. ‘sample_depth’ - Precalculated Depth Data per sample.

sample_depth <- data.frame(
  sample = unique(example_data$sample),
  group_depth = c(565395266, 755574283, 639909215, 675090988, 598104021,
                  611295330, 648531765, 713240735, 669734626, 684951248,
                  716913381, 692323218, 297661400, 172863681, 672259724,
                  740901132, 558051386, 733727643, 703349287, 884821671,
                  743311822, 799605045, 677693752, 701163532)
)

DT::datatable(sample_depth)
mf_data_global_precalc <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "none",
  calculate_depth = FALSE,
  precalc_depth_data = sample_depth
)

Table 14. Mutation Frequency per Sample, Precalculated Depth Example.

Example 3.9. Use precalculated depth values to calculate the base_6 per sample MF.

mf_data_6_precalc <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "base_6",
  calculate_depth = FALSE,
  precalc_depth = system.file("extdata", "Example_files",
                              "precalc_depth_base_6_example.txt",
                              package = "MutSeqR")
)

Table 16. Precalculated Depth Data per sample for base_6 subtype resolution.

Table 17. Frequency and Proportions of 6-base Subtypes per Sample, Precalculated Depth Example.

Summary Table

The function will output the resulting mf_data as a data frame with the MF and proportion calculated. If the summary parameter is set to TRUE, the data frame will be a summary table with the MF calculated for each group. If summary is set to FALSE, the MF will be appended to each row of the original mutation_data.

The summary table will include:

  • cols_to_group: all columns used to group the data.
  • Subtype column for the given resolution, if applicable.
  • Context column for the given resolution, if applicable.
  • sum_min & sum_max: the min/max mutation counts for the group(s)/subtypes.
  • group_depth: the total_depth summed across the group(s).
  • subtype_depth: the total_depth summed across the group(s), for the given context, if applicable.
  • MF_min & MF_max: the min/max MF for the group(s)/subtype.
  • proportion_min & proportion_max: the min/max subtype proportion, if applicable.

Additional columns from the orginal mutation data can be retained using the retain_metadata_cols parameter. Retaining higher-order experimental groups may be useful for later statistical analyses or plotting. See above for example calculating mean MF per dose.

NOTE The summary table uses a pre-defined list of possible subtypes for each resolution. If a particular subtype within a given group is not recorded in the mutation data, the summary table will have no frame of reference for populating the metadata_cols. Thus, for subtypes that do not occur in the mutation data for a given group, the corresponding metadata_col will be NA.

MF Plots

You can visualize the results of calculate_mf using the plot_mf() and plot_mean_mf() functions. These functions offer variable aesthetics for easy visualization. The output is a ggplot object that can be modified using ggplot2.

plot_mf

Example 3.10. Plot the Min and Max MF per sample, coloured and ordered by dose group. See example 3.1 for calculating mf_data_global

# Define the order for dose groups
mf_data_global$dose_group <- factor(
  mf_data_global$dose_group,
  levels = c("Control", "Low", "Medium", "High")
)
plot <- plot_mf(
  mf_data = mf_data_global,
  group_col = "sample",
  plot_type = "bar",
  mf_type = "both",
  fill_col = "dose_group",
  group_order = "arranged",
  group_order_input = "dose_group"
)
plot
Mutation Frequency (MF) Minimum and Maximum (mutations/bp) per Sample. Light colored bars represent MFmin and dark coloured bars represent MFmax. Bars are coloured and grouped by dose. Data labels are the number of mutations per sample.

Mutation Frequency (MF) Minimum and Maximum (mutations/bp) per Sample. Light colored bars represent MFmin and dark coloured bars represent MFmax. Bars are coloured and grouped by dose. Data labels are the number of mutations per sample.

plot_mean_mf()

Calculate and plot the mean MF for a user-defined group using plot_mean_mf().

Example 3.11. Plot the mean MF min per dose, including SEM and individual values coloured by dose. See example 3.1 for calculating mf_data_global

plot_mean <- plot_mean_mf(
  mf_data = mf_data_global,
  group_col = "dose_group",
  mf_type = "min",
  fill_col = "dose_group",
  add_labels = "none",
  group_order = "arranged",
  group_order_input = "dose_group",
  plot_legend = FALSE,
  x_lab = "Dose Group"
)
plot_mean
Mean Mutation Frequency (MF) Minimum per Dose. Lines are mean ± S.E.M. Points are individual samples, coloured by dose.

Mean Mutation Frequency (MF) Minimum per Dose. Lines are mean ± S.E.M. Points are individual samples, coloured by dose.

Modeling MF

Generalized Linear Modeling

An important component of analysing mutagencity data is how MF changes based on experimental variables.

The model_mf() function can be used to quantitatively evaluate how MF changes based on experimental variables. The function models the proportion of reads that carry a mutation (i.e. the MF) across user-supplied fixed and/or random effects and interaction parameters. Depending on the supplied effects, model_mf() will automatically choose to fit either a generalized linear model (GLM) using the glm() function from the stats library or a generalized linear mixed model (GLMM) using the glmer() function from the lme4 library.

Model Formula

  • Fixed effects: The experimental variable(s)/factor(s) of interest.
  • Random effects: Variable used to account for variability within a larger group. Covariate.
  • Interaction: The product of two or more interacting fixed_effects. An interaction implies that the effect of one fixed effect changes depending on the levels of another fixed effect, indicating a non-additive reationship.
  • Response: the function will model the MF, taking the mutation count (sum) and the group_depth as input.

The model formula is built as:

cbind(sum, group_depth) ~ fixed_effect1 + fixed_effect2 + ... + (1|random_effect)

If test_interaction is set to TRUE, the model will use the product of the fixed_effects:

cbind(sum, group_depth) ~ fixed_effect1 * fixed_effect2 * ... + (1|random_effect)

Family

The occurence of mutations is assumed to follow a binomial distribution as:

  1. There is a finite number of sequenced bases.
  2. A mutation at any given base is equally probable.
  3. Mutations occur independently of other mutations.

To account for over-dispersion, the function will fit a GLM with a quasibinomial distribution. If the dispersion parameter of the model is less than 1, the model will be refit using a binomial distribution. The GLMM will always use a binomial distribution.

The model_mf function will fit a generalized linear model to analyse the effect(s) of given fixed_effects(s) on MF and perform specified pairwise comparisons between levels of your factors.

Additional arguments can be passed to the model to further customize it to your needs. Details on the arguments for the generalized linear model can be found at stats::glm and for the general linear mixed model at lme4::glmer.

Estimates & Comparisons

model_mf() provides estimates of the mean for each level of the fixed effects. Furthermore, pairwise comparisons can be performed based on a user-supplied table. Mean estimates and comparisons are conducted using the doBy R library (Halekoh and Højsgaard 2025). Estimates of the back-transformed standard errors are approximated using the delta method. The p-values are adjusted for multiple comparisons using the Sidak method.

Goodness of Fit

The model_mf function will output the Pearson residuals appended to the mf_data. The row with the highest residual will be printed to the console so that users may assess it as an outlier. Additionally, Pearson residuals will be plotted as a histogram and a QQ-plot to check for deviances from model assumptions. We assume that residuals will follow a normal distribution with a mean of 0. Thus, we expect the histogram to follow a bell curve and the QQ-plot to be plotted along the y=x line.

General Usage

Mutation data should first be summarised by sample using calculate_mf(). The mf_data should be output as a summary table. Be sure to retain the columns for experimental variables of interest using the retain_metadata_cols parameter.

You may specify factors and covariates for your model using the fixed_effects and random_effects parameters respectively. If more than one fixed_effect is supplied, then you may specify whether you wish to test the interaction between your fixed_effects using the test_interaction parameter.

You must specify the columns in your mf_data that contain the mutation counts and the total_depth per sample using the muts and total_counts parameters respectively.

To perform pairwise comparisons between levels of your fixed effects, supply a constrast table using the contrasts parameter. This can either be a data frame or a filepath to a text file that will be loaded into R. The table must consist of two columns, each containing levels within the fixed_effects. The level in the first column will be compared to the level in the second column for each row. You should also provide the reference level for each fixed effect using the reference_level parameter. If you specify multiple pairwise comparisons, then the p-values will automatically be corrected for multiple comparisons using the Sidak method. For multiple fixed effects, the user must include levels for all fixed_effects in each value of the contrasts table. Within each value, the levels of the different fixed_effects should be seperated by a colon.

Output

The function will output a list of results.

  • model: the glmm or glmer model object.
  • model_formula: the formula given to glm or glmer.
  • model_data: the supplied mf_data with added column for Pearson residuals.
  • summary: the summary of the model.
  • anova: the analysis of variance for models with two or more fixed_effects. See car::Anova..
  • residuals_histogram: the Pearson residuals plotted as a histogram. This is used to check whether the variance is normally distributed. A symmetric bell-shaped histogram, evenly distributed around zero indicates that the normality assumption is likely to be true.
  • residuals_qq_plot: the Pearson residuals plotted in a quantile-quantile plot. For a normal distribution, we expect points to roughly follow the y=x line.
  • point_estimates_matrix: the contrast matrix used to generate point-estimates for the fixed effects.
  • point_estimates: the point estimates for the fixed effects.
  • pairwise_comparisons_matrix: the contrast matrix used to conduct the pairwise comparisons specified in the contrasts.
  • pairwise_comparisons: the results of pairwise comparisons specified in the contrasts.

Examples

Example 4.1 Model by Dose

Example 4.1. Model the effect of dose on MF. Our example data consists of 24 mouse samples, exposed to 3 doses of BaP or a vehicle control. Dose Groups are : Control, Low, Medium, and High. We will determine if the MFmin of each BaP dose group is significantly increased from that of the Control group. See example 3.1 for calculating mf_data_global

# Create a contrasts table for pairwise comparisons
contrasts_table <- data.frame(
  col1 = c("Low", "Medium", "High"),
  col2 = c("Control", "Control", "Control")
)
# Run the model
model_by_dose <- model_mf(
  mf_data = mf_data_global,
  fixed_effects = "dose_group",
  muts = "sum_min",
  total_count = "group_depth",
  reference_level = "Control",
  contrasts = contrasts_table
)
##        sample dose_group dose sum_min sum_max       mf_min       mf_max
## 14 dna00986.1     Medium   25     160     197 9.255848e-07 1.139626e-06
##    group_depth residuals
## 14   172863681  4.739692
Residuals Histogram
GLM residuals of MFmin modelled as an effect of Dose. x is pearson's residuals, y is frequency. Plotted to validate model assumptions. n = 24.

GLM residuals of MFmin modelled as an effect of Dose. x is pearson’s residuals, y is frequency. Plotted to validate model assumptions. n = 24.

Residuals QQ-plot
GLM residuals of MFmin modelled as an effect of Dose expressed as a quantile-quantile plot. Y is the pearson's residuals of the model in ascending order x is the quantiles of standard normal distribution for n of 24. Plotted to validate model assumptions.

GLM residuals of MFmin modelled as an effect of Dose expressed as a quantile-quantile plot. Y is the pearson’s residuals of the model in ascending order x is the quantiles of standard normal distribution for n of 24. Plotted to validate model assumptions.

Model Summary
model_by_dose$summary
## 
## Call:
## stats::glm(formula = model_formula, family = "quasibinomial", 
##     data = mf_data)
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      -15.56285    0.08290 -187.736  < 2e-16 ***
## dose_groupLow      0.67512    0.10066    6.707 1.58e-06 ***
## dose_groupMedium   1.29746    0.09563   13.567 1.51e-11 ***
## dose_groupHigh     1.69485    0.08915   19.011 2.84e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 4.604243)
## 
##     Null deviance: 2919.634  on 23  degrees of freedom
## Residual deviance:   89.569  on 20  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
Model Estimates

Table 18. model_by_dose$point_estimates: Estimated Mean Mutation Frequency per Dose.

Pairwise Comparisons

Table 19. model_by_dose$pairwise_comparisons

Example 4.2 Model by Dose and Target

Example 4.2.1. Model the effects of dose and genomic locus on MF. Sequencing for the example data was done on a panel of 20 genomic targets. We will determine if the MF of each BaP dose group is significantly different from the Control individually for all 20 targets. In this model, dose group and target label will be our fixed effects. We include the interaction between the two fixed effects. Because sample will be a repeated measure, we will use it as a random effect. See example 3.2 for calculating mf_data_rg

# Create a contrasts table for the pairwise comparisons.
combinations <- expand.grid(dose_group = unique(mf_data_rg$dose_group),
                            label = unique(mf_data_rg$label))
combinations <- combinations[combinations$dose_group != "Control", ]
combinations$col1 <- with(combinations, paste(dose_group, label, sep = ":"))
combinations$col2 <- with(combinations, paste("Control", label, sep = ":"))
contrasts2 <- combinations[, c("col1", "col2")]

# Run the model
# To improve convergence of the model, we will supply the control argument
# to the  glmer function
model_by_target <- model_mf(mf_data = mf_data_rg,
  fixed_effects = c("dose_group", "label"),
  test_interaction = TRUE,
  random_effects = "sample",
  muts = "sum_min",
  total_count = "group_depth",
  contrasts = contrasts2,
  reference_level = c("Control", "chr1"),
  control = lme4::glmerControl(optimizer = "bobyqa",
                               optCtrl = list(maxfun = 2e5))
)
##        sample label dose_group sum_min sum_max       mf_min       mf_max
## 33 dna00974.1  chr2    Control      23      36 5.982805e-07 9.364391e-07
##    group_depth residuals
## 33    38443503  5.481194
Residuals Histogram
GLMM residuals of MFmin modelled as an effect of Dose and Genomic Target. x is pearson's residuals, y is frequency. Plotted to validate model assumptions. n = 24.

GLMM residuals of MFmin modelled as an effect of Dose and Genomic Target. x is pearson’s residuals, y is frequency. Plotted to validate model assumptions. n = 24.

Residuals QQ-plot
GLMM residuals of MFmin modelled as an effect of Dose and Genomic Target expressed as a quantile-quantile plot. Y is the pearson's residuals of the model in ascending order x is the quantiles of standard normal distribution for n of 24. Plotted to validate model assumptions.

GLMM residuals of MFmin modelled as an effect of Dose and Genomic Target expressed as a quantile-quantile plot. Y is the pearson’s residuals of the model in ascending order x is the quantiles of standard normal distribution for n of 24. Plotted to validate model assumptions.

Model Summary
model_by_target$summary
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(sum_min, group_depth) ~ dose_group * label + (1 | sample)
##    Data: mf_data
## Control: ..1
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##    2825.5    3163.6   -1331.7    2663.5       399 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7513 -0.7140 -0.0326  0.6138  5.4812 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  sample (Intercept) 0.01088  0.1043  
## Number of obs: 480, groups:  sample, 24
## 
## Fixed effects:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.596e+01  2.222e-01 -71.807  < 2e-16 ***
## dose_groupHigh                1.903e+00  2.395e-01   7.947 1.92e-15 ***
## dose_groupLow                 6.208e-01  2.738e-01   2.267 0.023380 *  
## dose_groupMedium              1.293e+00  2.612e-01   4.951 7.39e-07 ***
## labelchr1.2                   1.382e-01  2.824e-01   0.489 0.624624    
## labelchr10                    4.517e-01  2.682e-01   1.684 0.092109 .  
## labelchr11                    8.366e-01  2.641e-01   3.168 0.001535 ** 
## labelchr12                    5.183e-01  2.660e-01   1.948 0.051365 .  
## labelchr13                    1.984e-02  2.908e-01   0.068 0.945593    
## labelchr14                    6.560e-01  2.671e-01   2.456 0.014043 *  
## labelchr15                    5.084e-01  2.660e-01   1.911 0.056012 .  
## labelchr16                    7.757e-01  2.693e-01   2.881 0.003970 ** 
## labelchr17                    4.593e-01  2.885e-01   1.592 0.111307    
## labelchr18                   -7.519e-03  2.958e-01  -0.025 0.979721    
## labelchr19                   -2.378e-01  3.084e-01  -0.771 0.440585    
## labelchr2                     6.258e-01  2.660e-01   2.352 0.018661 *  
## labelchr3                     2.568e-01  2.863e-01   0.897 0.369732    
## labelchr4                     5.106e-01  2.773e-01   1.841 0.065603 .  
## labelchr5                     4.209e-01  2.824e-01   1.490 0.136186    
## labelchr6                     2.937e-01  2.806e-01   1.046 0.295350    
## labelchr7                    -3.983e-04  2.958e-01  -0.001 0.998926    
## labelchr8                     4.383e-01  2.908e-01   1.507 0.131704    
## labelchr9                     7.000e-01  2.671e-01   2.621 0.008766 ** 
## dose_groupHigh:labelchr1.2   -2.483e-01  3.022e-01  -0.822 0.411347    
## dose_groupLow:labelchr1.2    -2.540e-03  3.459e-01  -0.007 0.994141    
## dose_groupMedium:labelchr1.2 -2.494e-02  3.289e-01  -0.076 0.939548    
## dose_groupHigh:labelchr10    -4.750e-01  2.882e-01  -1.648 0.099285 .  
## dose_groupLow:labelchr10     -3.356e-01  3.360e-01  -0.999 0.317857    
## dose_groupMedium:labelchr10  -3.318e-01  3.167e-01  -1.048 0.294861    
## dose_groupHigh:labelchr11    -5.554e-02  2.809e-01  -0.198 0.843268    
## dose_groupLow:labelchr11      4.252e-01  3.174e-01   1.340 0.180390    
## dose_groupMedium:labelchr11   4.359e-01  3.029e-01   1.439 0.150068    
## dose_groupHigh:labelchr12    -1.676e-01  2.836e-01  -0.591 0.554373    
## dose_groupLow:labelchr12     -1.544e-01  3.287e-01  -0.470 0.638477    
## dose_groupMedium:labelchr12  -1.756e-01  3.115e-01  -0.564 0.572958    
## dose_groupHigh:labelchr13    -3.978e-01  3.125e-01  -1.273 0.202970    
## dose_groupLow:labelchr13      2.170e-03  3.561e-01   0.006 0.995138    
## dose_groupMedium:labelchr13  -2.386e-01  3.425e-01  -0.697 0.486037    
## dose_groupHigh:labelchr14     1.207e-01  2.832e-01   0.426 0.670094    
## dose_groupLow:labelchr14      2.822e-01  3.228e-01   0.874 0.381936    
## dose_groupMedium:labelchr14   4.877e-01  3.055e-01   1.596 0.110378    
## dose_groupHigh:labelchr15    -1.145e+00  2.935e-01  -3.903 9.51e-05 ***
## dose_groupLow:labelchr15     -4.122e-01  3.348e-01  -1.231 0.218290    
## dose_groupMedium:labelchr15  -6.712e-01  3.214e-01  -2.088 0.036792 *  
## dose_groupHigh:labelchr16    -2.607e-01  2.878e-01  -0.906 0.364977    
## dose_groupLow:labelchr16      1.133e-01  3.279e-01   0.346 0.729681    
## dose_groupMedium:labelchr16   6.542e-02  3.125e-01   0.209 0.834154    
## dose_groupHigh:labelchr17     1.326e-01  3.059e-01   0.434 0.664586    
## dose_groupLow:labelchr17      3.705e-01  3.457e-01   1.072 0.283845    
## dose_groupMedium:labelchr17   3.523e-01  3.308e-01   1.065 0.286780    
## dose_groupHigh:labelchr18     3.822e-01  3.118e-01   1.226 0.220225    
## dose_groupLow:labelchr18      4.491e-01  3.525e-01   1.274 0.202678    
## dose_groupMedium:labelchr18   5.986e-01  3.347e-01   1.789 0.073665 .  
## dose_groupHigh:labelchr19     6.478e-02  3.273e-01   0.198 0.843111    
## dose_groupLow:labelchr19      3.157e-01  3.689e-01   0.856 0.392152    
## dose_groupMedium:labelchr19   3.342e-01  3.522e-01   0.949 0.342680    
## dose_groupHigh:labelchr2     -6.041e-01  2.869e-01  -2.106 0.035210 *  
## dose_groupLow:labelchr2      -2.459e-01  3.312e-01  -0.742 0.457836    
## dose_groupMedium:labelchr2   -2.993e-01  3.138e-01  -0.954 0.340146    
## dose_groupHigh:labelchr3     -1.072e+00  3.178e-01  -3.375 0.000739 ***
## dose_groupLow:labelchr3      -5.817e-01  3.701e-01  -1.572 0.116009    
## dose_groupMedium:labelchr3   -6.911e-01  3.503e-01  -1.973 0.048510 *  
## dose_groupHigh:labelchr4     -2.788e-01  2.967e-01  -0.940 0.347321    
## dose_groupLow:labelchr4      -1.492e-01  3.434e-01  -0.435 0.663885    
## dose_groupMedium:labelchr4   -5.001e-02  3.234e-01  -0.155 0.877121    
## dose_groupHigh:labelchr5     -2.603e-01  3.023e-01  -0.861 0.389180    
## dose_groupLow:labelchr5      -4.738e-02  3.471e-01  -0.136 0.891432    
## dose_groupMedium:labelchr5   -1.236e-01  3.312e-01  -0.373 0.709068    
## dose_groupHigh:labelchr6     -2.353e-01  2.999e-01  -0.785 0.432557    
## dose_groupLow:labelchr6      -4.394e-02  3.448e-01  -0.127 0.898606    
## dose_groupMedium:labelchr6   -2.395e-01  3.307e-01  -0.724 0.468894    
## dose_groupHigh:labelchr7      1.376e-01  3.131e-01   0.439 0.660386    
## dose_groupLow:labelchr7       4.021e-01  3.530e-01   1.139 0.254618    
## dose_groupMedium:labelchr7    5.092e-01  3.355e-01   1.518 0.129058    
## dose_groupHigh:labelchr8      2.822e-01  3.074e-01   0.918 0.358623    
## dose_groupLow:labelchr8       5.457e-01  3.454e-01   1.580 0.114098    
## dose_groupMedium:labelchr8    6.388e-01  3.299e-01   1.936 0.052824 .  
## dose_groupHigh:labelchr9     -4.547e-01  2.867e-01  -1.586 0.112740    
## dose_groupLow:labelchr9      -1.701e-01  3.304e-01  -0.515 0.606713    
## dose_groupMedium:labelchr9   -4.265e-01  3.172e-01  -1.345 0.178701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA
model_by_target$anova
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: cbind(sum_min, group_depth)
##                    Chisq Df Pr(>Chisq)    
## dose_group        600.61  3  < 2.2e-16 ***
## label            1239.68 19  < 2.2e-16 ***
## dose_group:label  129.41 57  1.493e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model Estimates

Table 20. model_by_target$point_estimates: Estimated Mean Mutation Frequency per Dose and Genomic Target.

Pairwise Comparisons

Table 21. model_by_target$pairwise_comparisons.

Plot Model Results

Plot the results of model_mf() using plot_model_mf(). This function will create a bar or line plot of the point estimates. Users can include the estimated standard error as error bars with plot_error_bars and add significance labels based on the pairwise comparisons with plot_signif. The function can plot model results with up to two fixed effects. Users must specify which fixed effect should be represented on the x-axis with x_effect. The other fixed effect will be represented by the fill aesthetic.

When adding significance labels, the function will generate a unique symbol for each unique level or combination of levels in column 2 of the contrasts table. If using two fixed effects, but your comparisons only contrast levels of one fixed effect, you may specify this using the ref_effect parameter. Supply it with the fixed effect that is being contrasted to reduce the number of unique symbols generated.

The output is a ggplot object that can be modified with ggplot2.

Plot Model by Dose
plot <- plot_model_mf(
  model_by_dose,
  plot_type = "bar",
  x_effect = "dose",
  plot_error_bars = TRUE,
  plot_signif = TRUE,
  x_order = c("Control", "Low", "Medium", "High"),
  x_label = "Dose Group",
  y_label = "Estimated Mean Mutation Frequency (mutations/bp)"
)
plot
Mean Mutation Frequency Minimum (mutations/bp) per Dose estimated using a generalized linear model. Error bars are the S.E.M. Symbols indicate significance differences (p < 0.05).

Mean Mutation Frequency Minimum (mutations/bp) per Dose estimated using a generalized linear model. Error bars are the S.E.M. Symbols indicate significance differences (p < 0.05).

Plot Model by Dose and Target

In this example, we only made comparisons between dose groups. For each contrast, we held the label (target) constant. Thus, we will set the ref_effect to dose_group so that significance labels are generated to indicate differences in dose, and not label.

# Define the order of the genomic targets for the x-axis:
# We will order them from lowest to highest MF at the High dose.
label_order <- model_by_target$point_estimates %>%
  dplyr::filter(dose_group == "High") %>%
  dplyr::arrange(Estimate) %>%
  dplyr::pull(label)

# Define the order of the doses for the fill
dose_order <- c("Control", "Low", "Medium", "High")

plot <- plot_model_mf(
  model = model_by_target,
  plot_type = "bar",
  x_effect = "label",
  plot_error_bars = TRUE,
  plot_signif = TRUE,
  ref_effect = "dose_group",
  x_order = label_order,
  fill_order = dose_order,
  x_label = "Target",
  y_label = "Mutation Frequency (mutations/bp)",
  fill_label = "Dose",
  plot_title = "",
  custom_palette = c("#ef476f",
                     "#ffd166",
                     "#06d6a0",
                     "#118ab2")
)
# Rotate the x-axis labels for clarity using ggplot2 functions.
plot <- plot + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90))
plot
Mean Mutation Frequency Minimum (mutations/bp) per Genomic Target and Dose estimated using a generalized linear mixed model. Error bars are the SEM. Symbols indicate significance differences (p < 0.05) between dose levels for individual genomic regions.

Mean Mutation Frequency Minimum (mutations/bp) per Genomic Target and Dose estimated using a generalized linear mixed model. Error bars are the SEM. Symbols indicate significance differences (p < 0.05) between dose levels for individual genomic regions.

Benchmark Dose Modeling

Dose-response models are essential for quatitative risk assessment of mutagenicity, as they provide a framework to evaluate the levels at which exposure to a substance might cause an adverse effect. The benchmark dose (BMD) is a dose that produces a predetermined change in the measured response, defined as the benchmark response (BMR). The BMD is used as a point of departure to derive human health-based guidance values to inform regulatory risk assessment such as the reference dose (RfD), the derived no-effect level (DNEL) or the acceptable daily intake (ADI).

The BMD is estimated by applying various mathmatical models to fit the dose-response data. Some requirements must be met before modelling the BMD. There must be a clear dose-response trend in the MF data. We suggest using model_mf() to test for significant increases in MF with dose prior to running a BMD analysis. In general, studies with more dose groups and a graded monotonic response with dose will be more useful for BMD analysis. A minimum of three dose groups + 1 control group is suggested. Datasets in which a response is only observed at the high dose are usually not suitable for BMD modeling. However, if the one elevated response is near the BMR, adequate BMD computation may result. For a better estimate of the BMD, it is preferable to have studies with one or more doses near the level of the BMR.

Protection and safety authorities recommend the use of model averaging to determine the BMD and its confidence intervals. Model averaging incorporates information across multiple models to acount for model uncertainty, allowing the BMD to be more accurately estimated.

Ideally, the BMR would be based on a consensus scientific definition of what minimal level of change in MF is biologically significant. Currently, the default provided by this package calculates the BMD at a 50% relative increase in MF from the background. This BMR was selected based on previous recommendations for genotoxicity assessment by White, Long, and Johnson (2020).

MutSeqR provides two functions for BMD modelling, each employing widely-used software designed to be consistent with methods used by regulatory authorities.

  1. bmd_proast() runs a modified version of the proast71.1 R library that is parametirized instead of menu-based.
  2. bmd_toxicr uses the ToxicR library, available on Github.

PROAST

bmd_proast will analyze the continuous, individual MF data following a log transformation. PROAST uses four families of nested models: exponential, Hill, inverse exponential, and log-normal. The Akaike information criterion (AIC) is used to select best fit. BMD confidence intervals are assessed by the Maximum Likelihood Profile method, or by model averaging via bootstrapping (recommended). BMR values are a user-defined relative increase in MF from the control. Alternatively, users may set the BMR as one standard-deviation from the control.

General Usage: bmd_proast

Supply bmd_proast with per-sample mf data calculated using calculate_mf(), with the dose column retained in the summary table. Dose column must be numeric.

Specify the column that contains the numeric dose values in dose_col. The function can model more than one response variable at once. Supply all response variables to response_col. If you wish to include a covariate in the analysis, supply the covariate variable to covariate_col. PROAST will assess if the BMD values differ significantly between levels of the covariate and give a BMD estimate for each.

It is highly recommended to use model averaging when calculating the BMD confidence intervals. Specify the number of bootstraps to run with num_bootstraps. The recommended value is 200, but be aware that this may take some time to run. PROAST model averaging will return the upper and lower 90% BMD confidence intervals. MutSeqR calculates the model-averaged BMD value as the median BMD of all bootstrap runs.

Users may choose to generate model plots with plot_results. Plots may be automatically saved to an output directory specified in output_path. Alternatively, when output_path is NULL, plots will automcatically be displayed and then returned within a list alongside the BMD summary results.

The function will output a data frame of final results, including a BMD estimate for each model family and the model averaging results, if applicable. Users may access the raw, unparsed PROAST results by setting raw_results = TRUE.

Example

Example 5.1. Calculate the BMD with model averaging for a 50% relative increase in MF from control. This will be calculated for both MFmin and MFmax. * See example 3.1 for calculating mf_data_global.*

proast_results <- bmd_proast(
  mf_data = mf_data_global,
  dose_col = "dose",
  response_col = c("mf_min", "mf_max"),
  bmr = 0.5,
  model_averaging = TRUE,
  num_bootstraps = 10, # recommended value 200
  plot_results = FALSE
)

Table 22. BMD Estimates by PROAST.

ToxicR

bmd_toxicr will analyze continuous, individual or continuous, summary MF data, assuming either a normal (default) or log-normal distribution. This function employs Bayesian estimation using either the Laplace Maximum a posteriori approach (default) (Gelman et al. 1995) or Markov chain Monte Carlo (MCMC) simulation (Brooks et al. 2011). The default model parameter’s prior distributions are specified in (Wheeler et al. 2020), but they are user-modifiable. The default models are described in the European Food Safety Authority’s 2022 Guidance on the use of benchmark dose approach in risk assessment (Committee et al. 2022). Model averaging may be applied using described methodologies (Wheeler et al. (2020); Wheeler et al. (2022)).

ToxicR offers several options for the BMR:

  • Relative deviation (rel): the BMD represents the dose that changes the mean MF a certain percentage from the background dose.
  • Standard deviation (sd): the BMD represents the dose associated with the mean MF changing a specified number of standard deviations from the background mean.
  • Absolute deviation (abs): the BMD represents the dose associated with a specified absolute deviation from the background mean.
  • Hybrid deviation (hybrid): the BMD represents the dose that changes the probability of an adverse event by a specified amount.

One of these options can be specified using the bmr_type parameter. The bmr parameter is set to a numeric value specifying the BMR, defined in relation to the calculation requested in bmr_type.

Selecting an appropriate BMR involves making judgements about the statistical and biological characteristics of the dataset and about the applications for which the resuling BMDs will be used.

Install ToxicR
devtools::install_github("NIEHS/ToxicR")

See the ToxicR repository on github for more information on installing the package.

General Usage: bmd_toxicr

Examples include external dependencies

Supply bmd_toxicr with the per-sample mf data from calculate_mf(). Retain the dose column in the summary table. Specify the column that contains the numeric dose values in dose_col. The function can model more than one response variable at once. Supply all response variables to response_col.

Calculate the BMD with model averaging by setting model_averaging = TRUE. The confidence level for the upper and lower confidence intervals can be defined with the alpha parameter (default 90% CI).

Users may choose to generate model plots with plot_results. If TRUE, plots may be automatically saved to an output directory specified in output_path. Alternatively, if output_path is NULL, plots will be returned within a list alongside the the BMD results.

The function will return the BMD with its upper and lower confidence intervals for each response variable. If model averaging, a breakdown of the model averaging process can be returned alongside the results ma_summary = TRUE. This will return the estimate for each model with its associated posterior probability.

Example

Example 5.2. Calculate the BMD with model averaging for a 50% relative increase in MF from control. This will be calculated for both MFmin and MFmax. See example 3.1 for calculation of mf_data_global.

toxicr_results <- bmd_toxicr(
  mf_data = mf_data_global,
  dose_col = "dose",
  response_col = c("mf_min", "mf_max"),
  bmr_type = "rel",
  bmr = 0.5,
  model_averaging = TRUE,
  ma_summary = TRUE,
  plot_results = FALSE
)
toxicr_results

Plot BMD CIs

plot_ci() creates a confidence interval (CI) plot of BMD results for easy comparison of BMDs between response variables.

The BMD values can be plotted on the log scale, if required.

Example 5.3. Compare the estimated BMD for MFmin (50% relative increase) by PROAST versus ToxicR

plot_results <- data.frame(
  Response = c("PROAST", "ToxicR"),
  BMD = c(9.111, 9.641894),
  BMDL = c(7.38, 8.032936),
  BMDU = c(10.9, 10.97636)
)
plot <- plot_ci(
  data = plot_results,
  order = "asc",
  x_lab = "Dose (mg/kg-bw/d)",
  y_lab = "BMD Method"
)
plot
Benchmark dose with 90% confidence intervals representing the dose at, which a 50% increase in mutation frequency occurs from reference level. Calculated using ROAST software. Black points represent the BMD, red points the BMDL, and blue points, the BMDU

Benchmark dose with 90% confidence intervals representing the dose at, which a 50% increase in mutation frequency occurs from reference level. Calculated using ROAST software. Black points represent the BMD, red points the BMDL, and blue points, the BMDU

Mutation Spectra Analysis

The mutation spectra is the pattern of mutation subtypes within a sample or group. The mutation spectra can inform on the mechanisms involved in mutagenesis.

Group Comparison

Compare the mutation spectra between experimental groups using the spectra_comparison() function. This function will compare the proportion of mutation subtypes at any resolution between user-defined groups using a modified contingency table approach (Piegorsch and Bailer 1994).

This approach is applied to the mutation counts of each subtype in a given group. The contingency table is represented as R*TR * T where RR is the number of subtypes, and TT is the number of groups. spectra_comparison() performs comparisons between T=2T = 2 specified groups. The statistical hypothesis of homogeneity is that the proportion (count/group total) of each mutation subtype equals that of the other group. To test the significance of the homogeneity hypothesis, the G2G^{2} likelihood ratio statistic is used:

G2=2i=1Rj=1TYijlog(YijEij)G^{2} = 2\ \sum_{i=1}^{R}\ \sum_{j=1}^{T}\ Y_{ij}\ log(\frac{Y_{ij}}{E_{ij}})

YijY_{ij} represents the mutation counts and EijE_{ij} are the expected counts under the null hypothesis. The G2G^{2} statistic possesses approximately a χ2\chi^{2} distribution in large sample sizes under the null hypothesis of no spectral differences. Thus, as the column totals become large, G2G^{2} may be referred to a χ2\chi^{2} distribution with (R1)(T1)(R - 1)(T - 1) degrees of freedom. The G2G^{2} statistic may exhibit high false positive rates in small sample sizes when referred to a χ2\chi^{2} distribution. In such cases, we instead switch to an F-distribution. This has the effect of reducing the rate at which G2G^{2} rejects each null hypothesis, providing greater stability in terms of false positive error rates. Thus when N/(R1)<20N/(R-1) < 20, where NN is the total mutation counts across both groups, the function will use a F-distribution, otherwise it will use a χ2\chi^{2}-distribution.

This comparison assumes independance among the observations. Each tabled observation represents a sum of independent contributions to the total mutant count. We assume independance is valid for mutants derived from a mixed population, however, mutants that are derived clonally from a single progenitor cell would violate this assumption. As such, it is recommended to use the MFmin method of mutation counting for spectral analyses to ensure that all mutation counts are independant. In those cases where the independence may be invalid, and where additional, extra-multinomial sources of variability are present, more complex, hierarchical statistical models are required. This is currently outside the scope of this package.

General Usage: spectra_comparison()

The first step is to calculate the per-group mf data at the desired subtype_resolution using calculate_mf(). This mf_data is then supplied to spectra_comparison() along with a contrasts table to specify the comparisons. The contrasts table will consist of two columns, each specifying a group to be contrasted against the other (see Generalized Linear Modelling for more details on contrast tables).

Specify the variable by which you wish to make your comparisons with The exp_variable parameter.

The function will output the G2G^{2} statistic and p-value for each specified comparison listed in constrasts. P-values are adjusted for multiple comparison using the Sidak method.

Example 6.1. In our example data, we are studying the mutagenic effect of BaP. Our samples were exposed to three doses of a BaP (Low, Medium, High), or to the vehicle control (Control). We will compare the base_6 snv subtypes, alongside non-snv variants, of each of the three dose groups to the control. In this way we can investigate if exposure to BaP leads to significant spectral differences.

# Calculate the MF per dose at the base_6 resolution.
mf_data_6_dose <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "dose_group",
  subtype_resolution = "base_6"
)

# Create the contrast table
contrasts_table <- data.frame(
  col1 = c("Low", "Medium", "High"),
  col2 = c("Control", "Control", "Control")
)
# Run the analysis
ex_spectra_comp <- spectra_comparison(
  mf_data = mf_data_6_dose,
  exp_variable = "dose_group",
  contrasts = contrasts_table
)

Table 23. Comparison of the 6-base subtype Proportions between Dose Groups.

Mutational Signatures

Mutational processes generate characteristic patterns of mutations, known as mutational signatures. Distinct mutational signatures have been extracted from various cancer types and normal tissues using data from the Catalogue of Somatic Mutations in Cancer, (COSMIC) database. These include signatures of single base substitutions (SBSs), doublet base substitutions (DBSs), small insertions and deletions (IDs) and copy number alterations (CNs). It is possible to assign mutational signatures to individual samples or groups using the signature_fitting function. Linking ECS mutational profiles of specific mutagens to existing mutational signatures provides empirical evidence for the contribution of environmental mutagens to the mutations found in human cancers and informs on mutagenic mechanisms.

The signature_fitting function utilizes the SigProfiler suite of tools (Díaz-Gay et al. (2023); Khandekar et al. (2023)) to assign SBS signatures from the COSMIC database to the 96-base SNV subtypes of a given group by creating a virtual environment to run python using reticulate. signature_fitting() facilitates interoperability between these tools for users less familiar with python and assists users by coercing the mutation data to the necessary structure for the SigProfiler tools.

This function will install several python dependencies using a conda virtual environment on first use, as well as the FASTA files for all chromosomes for your specified reference genome. As a result ~3Gb of storage must be available for the downloads of each genome.

SNVs in their 96-base trinucleotide context are summed across groups to create a mutation count matrix by SigProfilerMatrixGeneratorR() (SigProfilerMatrixGenerator; Khandekar et al. (2023)). Analyze.cosmic_fit (SigProfilerAssignment; Díaz-Gay et al. (2023)) is then run to assign mutational signatures to each group using refitting methods, which quantifies the contribution of a set of signatures to the mutational profile of the group. The process is a numerical optimization approach that finds the combination of mutational signatures that most closely reconstructs the mutation count matrix. To quantify the number of mutations imprinted by each signature, the tool uses a custom implementation of the forward stagewise algorithm and it applies nonnegative least squares, based on the Lawson-Hanson method. Cosine similarity values, and other solution statistics, are generated to compare the reconstructed mutational profile to the original mutational profile of the group, with cosine values > 0.9 indicating a good reconstruction.

Currently, signature_fitting offers fitting of COSMIC version 3.4 SBS signatures to the SBS96 matrix of any sample/group. For advanced use, including using a custom set of reference signatures, or fitting the DBS, ID, or CN signatures, it is suggested to use the SigProfiler python tools directly in as described in their respective documentation.

General Usage: signature_fitting()

Examples include external dependencies

signature_fitting requires the installation of reticulate and SigProfilerMatrixGeneratorR as well as a version of python 3.8 or newer.

# Install reticulate
install.packages("reticulate")

# Install python
reticulate::install_python()

# Install SigProfilerMatrixGeneratorR from github using devtools.
devtools::install_github("AlexandrovLab/SigProfilerMatrixGeneratorR")

signature_fitting will take the imported (and filtered, if applicable) mutation_data and create mutational matrices for the SNV mutations. Mutations are summed across levels of the group parameter. This can be set to individual samples or to an experimental group. Variants flagged in the filter_mut column will not be included in the mutational matrices.

The project_genome will be referenced for the creation of the mutational matrices. The reference genome will be installed if it is not already.

The virtual environment can be specified with the env_name parameter. If no such environmnent exists, then the function will create one in which to store the dependencies and run the signature refitting. Specify your version of python using the python_version parameter (must be 3.8 or higher).

Example 6.2. Determine the COSMIC SBS signatures associated with each BaP dose group.

# Run Analysis
signature_fitting(
  mutation_data = example_data, # filtered mutation data
  project_name = "Example",
  project_genome = "mm10",
  env_name = "MutSeqR",
  group = "dose_group",
  python_version = "3.11",
  output_path = NULL
)

Output

Results from the signature_fitting will be stored in an output folder. A filepath to a specific output directory can be designated using the output_path parameter. If null, the output will be stored within your working directory. Results will be organized into subfolders based on the group parameter. The output structure is divided into three folders: input, output, and logs.

Input folder: “mutations.txt”, a text file with the mutation_data coerced into the required format for SigProfilerMatrixGenerator. It consists of a list of all the snv variants in each group alongside their genomic positions. This data serves as input for matrix generation.

Log folder: the error and the log files for SigProfilerMatrixGeneration.

The output folder contains the results from matrix generation and signature refitting desribed in detail below.

Matrix Generation Output

signature_fitting uses SigProfilerMatrixGenerator to create mutational matrices ((Bergstrom et al. 2019)). Mutation matrices are created for single-base substitutions (SBS) and doublet-base substitutions (DBS), including matrices with extended sequence context and transcriptional strand bias. SBS and DBS matrices are stored in their respective folders in the output directory. Only the SBS96 matrix is used for refitting. Matrices are stored as .all files which can be viewed in a text-editor.

(#tab:mat-files) Output Files from SigProfilerMatrixGeneration by Folder.
Folder File Definition Plot file
SBS project.SBS6.all The 6 pyrimidine single-nucleotide variants. C>A, C>G, C>T, T>A, T>C, or T>G plots/SBS_6_plots_project.pdf
SBS project.SBS18.all The 6 pyrimidine single-nucleotide variants within 3 transcriptional bias categories: Untranscribed (U), Transcribed (T), Non-Transcribed Region (N).
SBS project.SBS24.all The 6 pyrimidine single-nucleotide variants within 4 transcriptional bias categories: Untranscribed (U), Transcribed (T), Bidirectional (B), Non-Transcribed Region (N). plots/SBS_24_plots_project.pdf
SBS project.SBS96.all The 6 pyrimidine single-nucleotide variants alongside their flanking nucleotides (4 x 4 = 16 combinations). Ex. A[C>G]T plots/SBS_96_plots_project.pdf
SBS project.SBS288.all The 96-base single-nucleotide variants within 3 transcriptional bias categories (U, T, N). plots/SBS_288_plots_project.pdf
SBS project.SBS384.all The 96-base single-nucleotide variants within 4 transcriptional bias categories (U, T, N, B). plots/SBS_384_plots_project.pdf
SBS project.SBS1536.all The 6 pyrimidine single-nucleotide variants alongside their flanking dinucleotides (16 x 16 = 256 combinations). Ex. AA[C>G]TT plots/SBS_1536_plots_project.pdf
SBS project.SBS4608.all The 1536-base single-nucleotide variants within 3 transcriptional bias categories (U, T, N).
SBS project.SBS6144.all The 1536-base single-nucleotide variants within 4 transcriptional bias categories (U, T, N, B).
DBS project.DBS78.all The 78 pyrimidine double-nucleotide variants. plots/DBS_78_plots_project.pdf
DBS project.DBS150.all The 36 dinucleotide combinations that have only all purines or all pyrimidines x 3 transcriptionla bias categories (U, T, N).
DBS project.DBS186.all The 36 dinucleotide combinations that have only all purines or all pyrimidines x 4 transcriptional bias categories (U, T, N, B). plots/DBS_186_plots_project.pdf
DBS project.DBS1248.all The 78 pyrimidine double-nucleotide variants alongside their flanking nucleotides (Possible starting nucleotides (4) x 78 x possible ending nucleotides (4) = 1248 combinations)
DBS project.DBS2400.all The 36 dinucleotide combinations that have only all purines or all pyrimidines alongside their flanking nucleotides within transcriptional bias categories (U, T, N).
DBS project.DBS2676.all The 36 dinucleotide combinations that have only all purines or all pyrimidines alongside their flanking nucleotides within 4 transcriptional bias categories (U, T, N, B).
TSB strandBiasTes_24.txt Transcription Strand Bias Test stats of the SBS6 variants
TSB strandBiasTes_384.txt Transcription Strand Bias Test stats of the SBS96 variants
TSB strandBiasTes_6144.txt Transcription Strand Bias Test stats of the SBS1536 variants
TSB significantResults_strandBiasTest.txt Returns significant results from the three files above.

Doublet-base Matrices (DBS): DBS are somatic mutations in which a set of two adjacent DNA base-pairs are simultaneously substituted with another set of two adjacent DNA base-pairs. We do not recommend using the DBS matrices generated using signature_fitting The signature_fitting function is designed to handle only SBS mutations. All true MNVs, including doublets, are filtered out of the mutation_data prior to MatrixGeneration. However, the tool will still attempt to identify DBSs and will occasionally find two independent SBSs occuring next to each other simply by chance. If you wish to use DBS mutations in your signature analysis, please refer directly to the SigProfiler tools.

Plots: Barplots of the mutation matrices for all groups can be found in the “plots” folder. The number of mutations are plotted for each group at the various subtype resolutions. Files for each matrix are listed in Table @ref(tab:mat-files).

vcf_files: This output folder provides text-based files containing the original mutations and their SigProfilerMatrixGenerator classification for each chromosome. The files are separated into dinucleotides (DBS), multinucleotide substitutions (MNS), and single nucleotide variants (SNV) folders containing the appropriate files. The headers are:

  1. The group
  2. the chromosome
  3. the position
  4. the SigProfilerMatrixGenerator classification
  5. the strand {1, 0, -1}.

The headers for each file are the same with the exception of the MNS files which don’t contain a matrix classification or a strand classification. As noted above the DBS and MNS matrices do no reflect the true mutation counts for these variant types. Only SBS/SNV mutations are included in the matrix generation.

Transcription Strand Bias (TSB): SBS and DBS mutations are tested for transcription strand bias. Mutations are first classified within the four transcriptional bias categories:

(#tab:transcript-cat) SigProfiler Transcriptional Bias Categories.
Category Description
Transcribed (T) The variant is on the transcribed (template) strand.
Untranscribed (U) The variant is on the untranscribed (coding) strand.
Bidirectional (B) The variant is on both strands and is transcribed either way.
Nontranscribed (N) The variant is in a non-coding region and is untranslated.

The tool will then perform a transcription strand bias test which compares the number of transcribed and untranscribed mutations for each mutation type. For example, it will compare the number of transcribed T>C to untranscribed T>C mutations. Should there be a significant difference, it would indicate that T:A>C:G mutations are occuring at a higher rate on one of the strands compared to the other. The output files contain the following headers:

  1. the group
  2. the mutation type
  3. the enrichment value (# Transcribed / # untranscribed)
  4. the p-value, corrected for multiple comparisons using the false discrover rate method
  5. the false discovery rate q-value
Signature Refitting Results

Results from the signature refitting perfomed by SigProfilerAssignment will be stored within the “Assignment_Solution” folder. “Assignment_Solution” consists of 3 subdirectories; “Activities”, “Signatures”, and “Solution_Stats”.

Activities

File Description
Assignment_Solution_Activities.txt This file contains the activity matrix for the selected signatures. The first column lists all of the samples/groups. All of the following columns list the calculated activity value for the respective signatures. Signature activities correspond to the specific numbers of mutations from the sample’s original mutation matrix caused by a particular mutational process.
Assignment_Solution_Activity_Plots.pdf This file contains a stacked barplot showing the number of mutations in each signature on the y-axis and the samples/groups on the x-axis.
Assignment_Solution_TMB_plot.pdf This file contains a tumor mutational burden plot. The y-axis is the somatic mutations per megabase and the x-axis is the number of samples/groups plotted over the total number of samples/groups included. The column names are the mutational signatures and the plot is ordered by the median somatic mutations per megabase.
Decomposed_Mutation_Probabilities.txt This file contains the probabilities of each of the 96 mutation types in each sample/group. The probabilities refer to the probability of each mutation type being caused by a specific signature. The first column lists all the samples/groups, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective signatures.
SampleReconstruction This folder contains generated plots for each sample/group summarizing the assignment results. Each plot consists of three panels. (i) Original: a bar plot of the inputted 96SBS mutation matrix for the sample/group. (ii) Reconstructed: a bar plot of the reconstruction of the original mutation matrix. (iii) The mutational profiles for each of the known mutational signatures assigned to that sample/group, including the activities for each signature. Accuracy metrics for the reconstruction are displayed at the bottom of the figure.

Signatures

Files Description
Assignment_Solution_Signatures.txt The distribution of mutation types in the input mutational signatures. The first column lists all 96 of the mutation types. The following columns are the signatures.
SBS_96_plots_Assignment_Solution.pdf Barplots for each signature identified that depicts the proportion of the mutation types for that signature. The top right corner also lists the total number of mutations and the percentage of total mutations assigned to the mutational signature.

Solution_Stats

Files Description
Assignment_Solution_Samples_Stats.txt The accuracy metrics for the reconstruction. statistics for each sample including the total number of mutations, cosine similarity, L1 norm (calculated as the sum of the absolute values of the vector), L1 norm percentage, L2 norm (calculated as the square root of the sum of the squared vector values), and L2 norm percentage, along with the Kullback-Leibler divergence.
Assignment_Solution_Signature_Assignment_log.txt The events that occur when known signatures are assigned to an input sample. The information includes the L2 error and cosine similarity between the reconstructed and original sample within different composition steps.

Other Files

JOB_METADATA_SPA.txt: This file contains the metadata about system and runtime.

Use SigProfiler Webtool

Users may choose to use the SigProfiler Webtool instead of using the signature_fitting()function. MutSeqR offers functions to coerce mutation data into the proper format for input files.

Mutation Calling File

write_mutation_calling_file() creates a simple text file from mutation data that can be used for mutation signatures analysis using the SigProfiler Assignment web application as a “mutation calling file”. Signature analyses are done at the sample level when using mutation calling files. The file will be saved to your output directory, specified in output_path.

Example 6.3. Analyze the COSMIC SBS signatures contributing to each of the 24 samples using the SigProfiler Web Tool. Output a mutation calling file that can be uploaded to the webtool.

write_mutation_calling_file(
  mutation_data = example_data,
  project_name = "Example",
  project_genome = "mm10",
  output_path = NULL
)
Mutational Matrix

write_mutational_matrix() will sum mutations across user-defined groups before coercing the data into the proper format for input as a “mutational matrix”. SNV subtypes can be resolved to either the base_6 or base_96 resolution. The file is saved to the specified output directory.

Example 6.4. Analyze the COSMIC SBS signatures contributing to each dose group using the SigProfiler Web Tool. Output a mutational matrix that can be uploaded to the webtool.

write_mutational_matrix(
  mutation_data = example_data,
  group = "dose_group",
  subtype_resolution = "base_96",
  mf_type = "min",
  output_path = NULL
)

Plot Spectra

plot_spectra

The mutation spectra can be visualized with plot_spectra which will create a stacked bar plot for user-defined groups at the desired subtype resolution. Mutation subtypes are represented by colour. The value can represent subtype count (sum), frequency (mf), or proportion.

Proportion

Example 6.5. Plot the base_6 proportions for each dose group.

# Calculate the mf data at the 6-base resolution for each dose
# We will exclude ambiguous or uncategorized variants since we don't
# have any in this data.
mf_data_6_dose <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "dose_group",
  subtype_resolution = "base_6",
  variant_types = c("-ambiguous", "-uncategorized")
)
# Set the desired order for the dose group:
mf_data_6_dose$dose_group <- factor(
  mf_data_6_dose$dose_group,
  levels = c("Control", "Low", "Medium", "High")
)
# Plot
plot <- plot_spectra(
  mf_data = mf_data_6_dose,
  group_col = "dose_group",
  subtype_resolution = "base_6",
  response = "proportion",
  group_order = "arranged",
  group_order_input = "dose_group",
  x_lab = "Dose Group",
  y_lab = "Subtype Proportion"
)
plot
Mutation spectrum (minimum) per Dose. Subtypes include single-nucleotide, variants at 6-base resolution, complex variants, deletions, insertions, multi-nucleotide variants (mnv) and structural variants (sv). Subtypes are represented by colour. Data is the proportion normalized to sequencing depth.

Mutation spectrum (minimum) per Dose. Subtypes include single-nucleotide, variants at 6-base resolution, complex variants, deletions, insertions, multi-nucleotide variants (mnv) and structural variants (sv). Subtypes are represented by colour. Data is the proportion normalized to sequencing depth.

Frequency

Example 6.6. Plot the base_6 frequences for each dose group.

# Plot
plot <- plot_spectra(
  mf_data = mf_data_6_dose,
  group_col = "dose_group",
  subtype_resolution = "base_6",
  response = "mf",
  group_order = "arranged",
  group_order_input = "dose_group",
  x_lab = "Dose Group",
  y_lab = "Subtype Frequency (mutations/bp)"
)
plot
Mutation spectrum (minimum) per Dose. Subtypes include single-nucleotide, variants at 6-base resolution, complex variants, deletions, insertions, multi-nucleotide variants (mnv) and structural variants (sv). Subtypes are represented by colour. Data is subtype frequency (mutations/bp).

Mutation spectrum (minimum) per Dose. Subtypes include single-nucleotide, variants at 6-base resolution, complex variants, deletions, insertions, multi-nucleotide variants (mnv) and structural variants (sv). Subtypes are represented by colour. Data is subtype frequency (mutations/bp).

Sum

Example 6.7. Plot the base_6 mutation sums for each dose group.

# Plot
plot <- plot_spectra(
  mf_data = mf_data_6_dose,
  group_col = "dose_group",
  subtype_resolution = "base_6",
  response = "sum",
  group_order = "arranged",
  group_order_input = "dose_group",
  x_lab = "Dose Group",
  y_lab = "Subtype Mutation Count"
)
plot

Hierarchical Clustering

plot_spectra() integrates cluster_spectra() which performs unsupervised hierarchical clustering of samples based on the mutation spectra. cluster_spectra() uses dist() from the stats library to compute the sample-to-sample distances using a user-defined distance measure (default Euclidean). The resulting distance matrix is passed to hclust() to cluster samples using the specified linkage method (default Ward). The function will output a dendrogram visually representing the clusters’ relationships and hierarchy. The dendrogram will be overlaid on the plot_spectra() bar plot and samples will be ordered accordingly.

Example 6.8. Plot the base_6 mutation spectra per sample, with hierarchical clustering. For this example we have created a new sample column with more intuitive sample names: new_sample_id. These names correspond to their associated dose groups. We will see that samples largly cluster within their dose groups.*

# Calculate the mf data at the 6-base resolution for each sample
mf_data_6 <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "new_sample_id",
  subtype_resolution = "base_6",
  variant_types = c("-ambiguous", "-uncategorized")
)
# Plot
plot <- plot_spectra(
  mf_data = mf_data_6,
  group_col = "new_sample_id",
  subtype_resolution = "base_6",
  response = "proportion",
  group_order = "clustered",
  x_lab = "Sample",
  y_lab = "Subtype Proportion"
)
plot
Mutation spectrum (minimum) per Sample. Subtypes include single-nucleotide variants at 6-base resolution, complex variants, deletions, insertions, multi-nucleotide variants (mnv) and structural variants (sv). Subtypes are represented by colour. Data is the proportion normalized to sequencing depth. Samples are clustered based on the Euclidean distance between their subtype proportions.

Mutation spectrum (minimum) per Sample. Subtypes include single-nucleotide variants at 6-base resolution, complex variants, deletions, insertions, multi-nucleotide variants (mnv) and structural variants (sv). Subtypes are represented by colour. Data is the proportion normalized to sequencing depth. Samples are clustered based on the Euclidean distance between their subtype proportions.

Trinucleotide Plots

The 96-base SNV mutation subtypes can be vizualised using plot_trinucleotide(). This function creates a bar plot of the 96-base SNV spectrum for all levels of a user-defined group. Data can represent subtype mutation count (sum), frequency (mf), or proportion. Aesthetics are consistent with COSMIC trinucleotide plots. Plots are automatically saved to the specified output directory.

Example 6.9. plot the base_96 mutation spectra proportions for each dose group.

# Calculate the mf data at the 96-base resolution for each dose
mf_data_96_dose <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "dose_group",
  subtype_resolution = "base_96",
  variant_types = "snv"
)
# Plot
plots <- plot_trinucleotide(
  mf_96 = mf_data_96_dose,
  group_col = "dose_group",
  response = "proportion",
  mf_type = "min",
  output_path = NULL
)
Control Dose
## $Control
96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth.  Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth. Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

Low Dose
## $Low
96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth.  Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth. Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

Medium Dose
## $Medium
96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth.  Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth. Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

High Dose
## $High
96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth.  Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

96-base trinucleotide spectra (minimum) per Dose. Bars are the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth. Bars are coloured based on SNV subtype. Data labels indicate the total number of mutations for each SNV subtype.

Heatmaps

Another option for vizualizing the base-96 mutation spectra is plot_trinucleotide_heatmap(). This function creates a heatmap of the 96-base SNV proportions. Plots can be facetted by additional grouping variables. Heatmaps are useful for making comparisons between experimental variables when information density becomes too high to represent using traditional plots.

Example 6.10. Plot the 96-base SNV spectrum for each sample, facetted by dose group.

# Calculate the mf data at the 96-base resolution for each sample
mf_data_96 <- calculate_mf(
  mutation_data = example_data,
  cols_to_group = "sample",
  subtype_resolution = "base_96",
  variant_types = "snv",
  retain_metadata_cols = "dose_group"
)
mf_data_96$dose_group <- factor(
  mf_data_96$dose_group,
  levels = c("Control", "Low", "Medium", "High")
)
# Plot
plot <- plot_trinucleotide_heatmap(
  mf_data = mf_data_96,
  group_col = "sample",
  facet_col = "dose_group"
)
plot
96-base trinucleotide spectra (minimum) per Sample, facetted by Dose. Colour represents the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth.

96-base trinucleotide spectra (minimum) per Sample, facetted by Dose. Colour represents the proportion of SNV subtypes within their trinucleotide context normalized to the sequencing depth.

Bubble Plots

plot_bubbles is used to visually represent the distribution and density of recurrent mutations. Each mutation is in a given group is represented by a bubble whose size is scaled on either the alt_depth or the vaf. Thus a highly reccurent mutation is represented by a large bubble. These plots make it easy to determine if MFmax is driven by a few highly recurrent mutations versus serveral moderately recurrent mutations.

Plots can be facetted by user-defined groups, and bubbles can be coloured by any variable of interest to help discern patterns in mutation recurrence.

Example 7. Plot mutations per dose group, bubbles coloured by base-6 subtype

plot <- plot_bubbles(
  mutation_data = example_data,
  size_by = "alt_depth",
  facet_col = "dose_group",
  color_by = "normalized_subtype"
)
plot
Multiplet mutations plotted per Dose. Each circle represents a mutation, coloured by mutation subtype. The size of the circle is scaled by the mutation's alternative depth.

Multiplet mutations plotted per Dose. Each circle represents a mutation, coloured by mutation subtype. The size of the circle is scaled by the mutation’s alternative depth.

Retrieve Sequences

get_seq() will retrive raw nucleotide sequences for specified genomic intervals. This function will install an appropriate BS genome library to retrieve sequences based on species, genome, and masked parameter.

Supply regions with a filepath, data frame or GRanges object containing the specified genomic intervals. TwinStrand’s Mutagenesis Panels are stored in package files and can easily be retrieved.

Sequences are returned within a GRanges object.

Example 8.1. Retrieve the sequences for our example’s target panel, TwinStrand’s Mouse Mutagenesis Panel

regions_seq <- get_seq(regions = "TSpanel_mouse")
regions_seq
## GRanges object with 20 ranges and 6 metadata columns:
##        seqnames              ranges strand | target_size       label
##           <Rle>           <IRanges>  <Rle> |   <integer> <character>
##    [1]     chr1   69304218-69306617      * |        2400        chr1
##    [2]     chr1 155235939-155238338      * |        2400      chr1.2
##    [3]     chr2   50833176-50835575      * |        2400        chr2
##    [4]     chr3 109633161-109635560      * |        2400        chr3
##    [5]     chr4   96825281-96827680      * |        2400        chr4
##    ...      ...                 ...    ... .         ...         ...
##   [16]    chr15   66779763-66782162      * |        2400       chr15
##   [17]    chr16   72381581-72383980      * |        2400       chr16
##   [18]    chr17   94009029-94011428      * |        2400       chr17
##   [19]    chr18   81262079-81264478      * |        2400       chr18
##   [20]    chr19     4618814-4621213      * |        2400       chr19
##        genic_context region_GC_content      genome                sequence
##          <character>         <numeric> <character>          <DNAStringSet>
##    [1]    intergenic              37.3        mm10 CAATCTTTCT...CAAAATGCAA
##    [2]         genic              54.0        mm10 AATCTCCAGG...CAAGCACTGG
##    [3]    intergenic              45.3        mm10 TGTGCCCCAT...TGCTTGCCAC
##    [4]         genic              39.2        mm10 AACGATGAAT...GCACTCAAGA
##    [5]    intergenic              39.4        mm10 ATTGTTTGAA...CTCAGGGCCT
##    ...           ...               ...         ...                     ...
##   [16]         genic              44.0        mm10 GTGTCATTTT...CAGGTAGAGG
##   [17]    intergenic              38.3        mm10 TCTGTAGCAA...ATAAAAACTC
##   [18]    intergenic              35.2        mm10 TAAGGAAACT...TATCAAAGAT
##   [19]    intergenic              47.3        mm10 AGCCATCTCC...GGGACTCAGA
##   [20]         genic              56.1        mm10 TCCAGGCTGT...GCCCATGGAG
##   -------
##   seqinfo: 19 sequences from an unspecified genome; no seqlengths

Example 8.2. Retrieve sequences for a custom interval of regions. We will use the Human Mutagenesis Panel as an example.

# We will load the TSpanel_human regions file as an example
human <- load_regions_file("TSpanel_human")
regions_seq <- get_seq(regions = human,
                       is_0_based_rg = FALSE,
                       species = "human",
                       genome = "hg38",
                       masked = FALSE,
                       padding = 0)
regions_seq
## GRanges object with 20 ranges and 7 metadata columns:
##        seqnames              ranges strand | target_size description
##           <Rle>           <IRanges>  <Rle> |   <integer> <character>
##    [1]    chr11 108510788-108513187      * |        2400 region_1111
##    [2]    chr13   75803913-75806312      * |        2400 region_1501
##    [3]    chr14   74661756-74664155      * |        2400 region_1725
##    [4]    chr18     5749265-5751664      * |        2400 region_2457
##    [5]     chr2   40162768-40165167      * |        2400 region_2896
##    ...      ...                 ...    ... .         ...         ...
##   [16]    chr15   46089738-46092137      * |        2400 region_1904
##   [17]    chr17   70672727-70675126      * |        2400 region_2378
##   [18]    chr21   23665977-23668376      * |        2400 region_3515
##   [19]    chr22   48262371-48264770      * |        2400 region_3703
##   [20]    chr10 128969038-128971437      * |        2400  region_784
##        genic_context        gene      genome       label
##          <character> <character> <character> <character>
##    [1]         genic       EXPH5        hg38       chr11
##    [2]         genic        LMO7        hg38       chr13
##    [3]         genic       AREL1        hg38       chr14
##    [4]         genic   MIR3976HG        hg38       chr18
##    [5]         genic  SLC8A1-AS1        hg38        chr2
##    ...           ...         ...         ...         ...
##   [16]    intergenic        <NA>        hg38       chr15
##   [17]    intergenic        <NA>        hg38       chr17
##   [18]    intergenic        <NA>        hg38       chr21
##   [19]    intergenic        <NA>        hg38       chr22
##   [20]    intergenic        <NA>        hg38       chr10
##                       sequence
##                 <DNAStringSet>
##    [1] GTTTCCTTCA...CTTTCCTGGA
##    [2] AGAATTATTT...TCAGACAACC
##    [3] TTCCCTGGTT...AAGATACACT
##    [4] TGGCAACTTG...AATGAAAACA
##    [5] CTAGATTTTC...AGCATATCAC
##    ...                     ...
##   [16] TGAACAGACA...ATAAATTGCT
##   [17] GTGGTGATCA...TAAAGATTCT
##   [18] TGTAATAATG...TCCAGTCATT
##   [19] GTAAAGGCAG...TCCACAGCAG
##   [20] GTGGACTGAT...TTCTCACACT
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

Sequences can be exported as FASTA files with write_reference_fasta(). Supply this function with the GRanges object containing the sequences of the regions. Each one will be written to a single FASTA file. The FASTA file will be saved to the output_path. If NULL, the file will be saved to the working directory.

write_reference_fasta(regions_seq, output_path = NULL)

Exporting Results

Users can easily output data frames to an Excel workbook with write_excel(). This function can write single data frames or it can take a list of dataframes and write each one to a separate Excel sheet in a workbook.

In addition to data frames, write_excel() will also extract the mf_data, point_estimates, and pairwise_comparisons from model_mf() output to write to an excel workbook. Set model_results to TRUE if supplying the function with the output to model_mf().

Example 9.1. Write MF data to excel workbook.

# save a single data frame to an Excel file
write_excel(mf_data_global, workbook_name = "example_mf_data")

# Write multiple data frames to a list to export all at once.
list <- list(mf_data_global, mf_data_rg, mf_data_6)
names(list) <- c("mf_per_sample", "mf_per_region", "pf_6spectra")

#save a list of data frames to an Excel file
write_excel(list, workbook_name = "example_mf_data_list")

Example 9.2. Export model results

write_excel(
  model_by_dose,
  workbook_name = "Example_model",
  model_results = TRUE
)

Mutation data can be written to a VCF file for downstream applications with write_vcf_from_mut().

write_vcf_from_mut(example_data)

Summary Report

To facilitate reproducible analysis of error-corrected sequencing data, the MutSeqR package includes a pre-built R Markdown workflow: Summary_Report.Rmd. This report template allows users to efficiently evaluate the effects of an experimental variable (e.g., dose, age, tissue) on mutation frequency and mutational spectra.

The workflow guides users through a standardized analysis pipeline that includes:

  • Mutation data Import
  • Mutation data Filtering
  • Calculation of Global Mutation Frequencies per Sample
  • Visualization of mutation frequencies per Sample and Mean mutation frequencies per experimental group.
  • Generalized Linear Modeling of Mutation Frequency (min) as an effect of the experimental variable.
  • Optional Benchmark Dose Modeling of mutation frequency across a dose or concentration variable.
  • Calculation of Mutation Subtype Frequencies and Proportions at the base-6 and base-96 resolutions.
  • Comparison of base-6 mutation spectra between experimental groups.
  • Visualization of base-6 subtype proportions per sample with hierarchical clustering
  • Visualization of base-6 subtype proportions per experimental group.
  • Visualization of base-96 subtype proportions per experimental group.
  • Mutation signature analysis per experimental group.
  • Generalized Linear Mixed Modeling of mutation frequency as an effect of specified genomic regions and the experimental group.
  • Visualization of multiplet mutations per experimental group.

All figures will be included in the report. Data frames included in the report will also be exported to excel workbooks within a predefined output path.

General Usage

Users must download the config file “summary_config.yaml”. This file defines the parameters to be passed to the Summary_Report.rmd. Users must fill out the .yaml file and save it.

config <- system.file("extdata", "inputs", "summary_config.yaml", package = "MutSeqR")
file.copy(from = config, to = "path/to/summary_config.yaml")

Users can then render the report using MutSeqR’s render_report(). Provide the function with the .yaml file using the config_file_path parameter. Provide the name and the file path of the output file to the output_file parameter. Set the output_format to one of “html_document” (recommended), “pdf_document”, or “all”.

render_report(config_file_path = "path/to/summary_config.yaml",
              output_file = "path/to/output_file.html",
              output_format = "html_document")

Parameters

Below is a description of the parameters set in summary_config.yaml. Parameters are catagerised by their function.

System Set Up

  • projectdir: The working directory for the project. All filepaths will be imported from the projectdir. If NULL, the current working directory will be used.
  • outputdir: The output directory where output files will be saved. If NULL, the projectdir will be used.

Project Details

  • project_title: The project title.
  • research_name: The name of the principle investigator for the project.
  • user_name: The name of the one running the analysis.
  • project_description: Optional description of the project.

Workflow Profile

  • config_profile : Choose a workflow profile to load pre-defined parameters. Current options are “Duplex Sequencing Mouse Mutagenesis Panel”, “Duplex Sequencing Human Mutagenesis Panel”, “Duplex Sequencing Rat Mutagenesis Panel”, and “None”. Pre-defined parameters are set to handle data import and filtering. More profiles coming soon. If not using a pre-defined profile (“None”), users must fill out the Custom_Profile_Params (see below).

Data Import

  • species: The species of the model organism. This can be the common name or the scientific name. Ex. “Mouse” or “Mus musculus”.
  • genome: The reference genome. This is used to populate the sequence context column. Ex. “mm10”.
  • file_type: The format of the mutation data to be imported. Options are table or vcf.
  • mutation_file: The file path to the mutation data file(s). This can be an individual file or a directory containing multiple files. The file path will be read as projectdir/mutation_file.
  • mut_sep: The delimiter for importing tabular mutation data. Ex. “ is tab-delimited.
  • sample_data: An optional file path to the sample metadata that will be joined with the mutation data. Set to NULL if not importing metadata. The file path will be read as projectdir/sample_data.
  • sd_sep: The delimiter for importing the sample_data file. Ex. “ is tab-delimited.

Calculating MF

Precalculated-depth files: The Summary_Report will automatically check during import whether the mutation data contains a depth column (total_depth or depth). If a depth column does not exist, per-sample precalculated depth can be provided for any subtype resolution to calculate mutation frequencies. If an exp_variable is provided, the Summary_Report will automatically sum the per-sample depths to obtain the depth per experimental group as needed.

In order for MutSeqR to calculate the depth from the mutation data, the data must have the depth-value for every sequenced site. It is not sufficient to calculate the depth from a list of variants. No-variant sites must be included. The Summary Report will check for a total_depth column, but it will not check whether no_variant sites are included. If it finds a total_depth column, it will proceed to calculate the depth from the mutation data.

  • precalc_depth_data_global: Optional file path to the precalculated per-sample total_depth data. This is the total number of bases sequenced per sample, used for calculating mutation frequencies. Columns are “sample” and “group_depth”. If using an exp_variable (see below), please also include it in this table. The file path will be read as projectdir/precalc_depth_data_global.
  • precalc_depth_data_base6: Optional file path to the precalculated per-sample total_depth data in the base_6 context. This is the total number of C and T bases sequenced for each sample. Columns are “sample”, “normalized_ref”, and “subtype_depth”. If using an exp_variable (see below), please also include it in this table. The file path will be read as projectdir/precalc_depth_data_base6.
  • precalc_depth_data_base96: Optional file path to the precalculated per-sample total_depth data in the base_96 context. This is the total number bases sequenced per sample for each of the 32 possible trinucleotide contexts in their pyrimidine notation. Columns are “sample”, “normalized_context”, and “subtype_depth”. If using an exp_variable, please also include it in this table. The file path will be read as projectdir/precalc_depth_data_base96.
  • precalc_depth_data_rg: Optional file path to the precalculated per-sample total_depth data for each target region. This is the total number of bases sequenced per sample for each region. Columns are “sample”, ‘region_col’, “group_depth”. Only applicable if performing regions analysis. The file path will be read as projectdir/precalc_depth_data_rg.
  • d_sep: The delimiter for importing the precalc_depth_data files.

Statistical Analysis

The Summary Report will run several analyses to investigate the effect of a specified experimental variable on mutation frequency and spectra.

  • exp_variable: Optional. The experimental variable of interest. This argument should refer to a column in your mutation data or sample_data. The Summary Report does not support the analysis of multiple exp_variables; supply only one column name. Ex. “dose”. If NULL, statistical analyses will be skipped.
  • exp_variable_order: A vector specifying the unique levels of the exp_variable in the order desired for plotting.
  • reference_level: The reference level of the exp_variable. Ex. the vehicle control group for a chemical dose. This value must match one of the levels within the exp_variable column.
  • contrasts: An optional file path to the contrasts table specifying pairwise comparisons between levels of the experimental variable. Required 2 columns, no header. This contrasts table will be used for Generalized Linear Modeling of MFmin by the exp_variable (model_mf), Generalized Linear Mixed Modeling of MFmin by the exp_variable and target regions, if using a targeted panel (model_mf), and comparing the base_6 mutation spectra between exp_variable levels (spectra_comparison). The file path will be read as projectdir/contrasts. If NULL, spectra_comparison will be skipped, and the GLM/GLMM will return only the model estimates per exp_variable.
  • cont_sep: the delimiter for importing the contrasts file.
  • bmd: A logical variable indicating whether to run a BMD analysis on the MF data. If TRUE, the exp_variable must refer to numeric dose or concentration values.
  • bmr: A numeric value indicating the benchmark response for the bmd analysis. The Summary Report defines the bmr as a bmr-% increase in MF from the reference level. Default is 0.5.
  • bmd_method: The method for running the BMD analysis. Optionsare “proast” or “toxicr”.
  • run_sigfitting: A logical variable indicating whether to run signature_fitting. If an exp_variable is provided,the signature analysis will be performed per exp_variable level, otherwise it will be performed per sample. If TRUE, this will create a virtual environment to run python on first use.
  • python_version: The python version installed on the operating system.

Custom Profile Parameters

Users who do not wish to use one of the pre-defined parameter profiles may still run the Summary Report by filling out the Custom_Profile_Params. If using one of the config_profiles, users should skip this section.

  • ecs_technology: The technology used to generate the data.
  • is_0_based_mut: A logical variable indicating whether the genomic coordinates of the tabular mutation data are 0-based (TRUE), or 1-based (FALSE).
  • regions: An optional file path to the regions metadata file. The file must contain columns “contig”, “start”, “end” in addition to the metadata.
  • is_0_based_rg: A logical variable indicating whether the genomic coordinates of the regions file are 0-based (TRUE), or 1-based (FALSE).
  • rg_sep: The delimiter for importing the regions file.
  • vaf_cutoff: A numeric value from 0-1. Will flag variants with a VAF > vaf_cutoff as germline variants and filter them from mutation counts.
  • snv_in_germ_mnv: A logical variable indicating whether filter_mut should flag SNVs that overlap with germline MNVs for filtering.
  • rm_abnormal_vaf: A logical variable indicaring whether filter_mut should remove records where the VAF is between 0.05-0.45 or between 0.55-0.95.
  • custom_filter_col: A column name used by filter_mut to apply a custom filter to the specified column.
  • custom_filter_val: The value in custom_filter_col by which to filter.
  • custom_filter_rm: A logical variable indicating whether records that contain the custom_filter_val within the custom_filter_col should be removed (TRUE) or flagged (FALSE).
  • filtering_regions: Optional file path to regions file by which to filter variants. Must contain the contig, start, and end of each region. The file path will be read as projectdir/filtering_regions.
  • filtering_rg_sep: The delimiter for importing the filtering_regions file.
  • regions_filter: “keep_within” will remove any records that fall outside of the filtering_regions. “remove_within” will remove records that fall inside of the filtering_regions.
  • allow_half_overlap: A logical variable indicating whether to include records that half overlap with the regions. If FALSE, the start and end position of the record must fall within the region interval to be counted as “falling in the region”. If TRUE, records that start/end within the region interval, but extend outside of it will be counted as “falling inside the region”.
  • filtering_is_0_based_rg: A logical variable indicating whether the genomic coordinates of the filtering_regions are 0-based (TRUE) or 1-based (FALSE).
  • rm_filtered_mut_from_depth: A logical variable indicating whether the alt_depth of variants flagged during the filtering process should be removed from their total_depth values. This does not apply to records flagged as germline variants.
  • do_regions_analysis: A logical variable indicating whether to perform Generalized Linear Mixed Modeling of the data by sequencing target. This is applicable to data sets that used a targeted panel or has specific regions of interest.
  • region_col: The column name that uniquely identifies each target region for the regions_analysis. This column must be present in the mutation_data or in the regions metadata file.

Output

References

Appendix

Session Info

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5  BSgenome.Mmusculus.UCSC.mm10_1.4.3
##  [3] BSgenome_1.76.0                    rtracklayer_1.68.0                
##  [5] BiocIO_1.18.0                      Biostrings_2.76.0                 
##  [7] XVector_0.48.0                     GenomicRanges_1.60.0              
##  [9] GenomeInfoDb_1.44.0                IRanges_2.42.0                    
## [11] S4Vectors_0.46.0                   BiocGenerics_0.54.0               
## [13] generics_0.1.4                     MutSeqR_0.99.0                    
## [15] htmltools_0.5.8.1                  DT_0.33                           
## [17] BiocStyle_2.36.0                  
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          jsonlite_2.0.0             
##   [3] magrittr_2.0.3              GenomicFeatures_1.60.0     
##   [5] farver_2.1.2                nloptr_2.2.1               
##   [7] rmarkdown_2.29              fs_1.6.6                   
##   [9] ragg_1.4.0                  vctrs_0.6.5                
##  [11] memoise_2.0.1               minqa_1.2.8                
##  [13] Rsamtools_2.24.0            RCurl_1.98-1.17            
##  [15] S4Arrays_1.8.1              curl_6.4.0                 
##  [17] broom_1.0.8                 SparseArray_1.8.0          
##  [19] Formula_1.2-5               sass_0.4.10                
##  [21] bslib_0.9.0                 htmlwidgets_1.6.4          
##  [23] desc_1.4.3                  cachem_1.1.0               
##  [25] GenomicAlignments_1.44.0    lifecycle_1.0.4            
##  [27] pkgconfig_2.0.3             Matrix_1.7-3               
##  [29] R6_2.6.1                    fastmap_1.2.0              
##  [31] GenomeInfoDbData_1.2.14     rbibutils_2.3              
##  [33] MatrixGenerics_1.20.0       digest_0.6.37              
##  [35] colorspace_2.1-1            patchwork_1.3.0            
##  [37] AnnotationDbi_1.70.0        rprojroot_2.0.4            
##  [39] textshaping_1.0.1           crosstalk_1.2.1            
##  [41] RSQLite_2.4.1               labeling_0.4.3             
##  [43] httr_1.4.7                  abind_1.4-8                
##  [45] compiler_4.5.1              microbenchmark_1.5.0       
##  [47] here_1.0.1                  bit64_4.6.0-1              
##  [49] withr_3.0.2                 backports_1.5.0            
##  [51] BiocParallel_1.42.1         carData_3.0-5              
##  [53] DBI_1.2.3                   MASS_7.3-65                
##  [55] DelayedArray_0.34.1         rjson_0.2.23               
##  [57] tools_4.5.1                 glue_1.8.0                 
##  [59] restfulr_0.0.15             nlme_3.1-168               
##  [61] grid_4.5.1                  checkmate_2.3.2            
##  [63] gtable_0.3.6                tidyr_1.3.1                
##  [65] data.table_1.17.6           doBy_4.6.27                
##  [67] car_3.1-3                   Deriv_4.1.6                
##  [69] pillar_1.10.2               stringr_1.5.1              
##  [71] splines_4.5.1               dplyr_1.1.4                
##  [73] lattice_0.22-7              bit_4.6.0                  
##  [75] tidyselect_1.2.1            knitr_1.50                 
##  [77] reformulas_0.4.1            SummarizedExperiment_1.38.1
##  [79] xfun_0.52                   Biobase_2.68.0             
##  [81] matrixStats_1.5.0           stringi_1.8.7              
##  [83] UCSC.utils_1.4.0            yaml_2.3.10                
##  [85] boot_1.3-31                 evaluate_1.0.4             
##  [87] codetools_0.2-20            tibble_3.3.0               
##  [89] BiocManager_1.30.26         packcircles_0.3.7          
##  [91] cli_3.6.5                   systemfonts_1.2.3          
##  [93] Rdpack_2.6.4                jquerylib_0.1.4            
##  [95] dichromat_2.0-0.1           modelr_0.1.11              
##  [97] Rcpp_1.0.14                 png_0.1-8                  
##  [99] XML_3.99-0.18               parallel_4.5.1             
## [101] pkgdown_2.1.3               ggh4x_0.3.1                
## [103] ggplot2_3.5.2               blob_1.2.4                 
## [105] dendsort_0.3.4              plyranges_1.28.0           
## [107] bitops_1.0-9                lme4_1.1-37                
## [109] viridisLite_0.4.2           VariantAnnotation_1.54.1   
## [111] scales_1.4.0                purrr_1.0.4                
## [113] crayon_1.5.3                rlang_1.1.6                
## [115] cowplot_1.1.3               KEGGREST_1.48.1
Bergstrom, Erik N., Mi Ni Huang, Uma Mahto, Mark Barnes, Michael R. Stratton, Steven G. Rozen, and Ludmil B. Alexandrov. 2019. SigProfilerMatrixGenerator: A Tool for Visualizing and Exploring Patterns of Small Mutational Events.” BMC Genomics 20 (1): 685. https://doi.org/10.1186/s12864-019-6041-2.
Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011. Handbook of Markov Chain Monte Carlo. CRC Press.
Committee, EFSA Scientific, Simon John More, Vasileios Bampidis, Diane Benford, Claude Bragard, Thorhallur Ingi Halldorsson, Antonio F Hernández-Jerez, et al. 2022. “Guidance on the Use of the Benchmark Dose Approach in Risk Assessment.” EFSA Journal 20 (10): e07584. https://doi.org/10.2903/j.efsa.2022.7584.
Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, et al. 2011. “The Variant Call Format and VCFtools.” Bioinformatics 27 (15): 2156–58. https://doi.org/10.1093/bioinformatics/btr330.
Díaz-Gay, Marcos, Raviteja Vangara, Mark Barnes, Xi Wang, S M Ashiqul Islam, Ian Vermes, Stephen Duke, et al. 2023. “Assigning Mutational Signatures to Individual Samples and Individual Somatic Mutations with SigProfilerAssignment.” Edited by Christina Kendziorski. Bioinformatics 39 (12): btad756. https://doi.org/10.1093/bioinformatics/btad756.
Dodge, Annette E., Danielle P. M. LeBlanc, Gu Zhou, Andrew Williams, Matthew J. Meier, Phu Van, Fang Yin Lo, et al. 2023. “Duplex Sequencing Provides Detailed Characterization of Mutation Frequencies and Spectra in the Bone Marrow of MutaMouse Males Exposed to Procarbazine Hydrochloride.” Archives of Toxicology 97 (8): 2245–59. https://doi.org/10.1007/s00204-023-03527-y.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 1995. Bayesian Data Analysis. New York: Chapman; Hall/CRC. https://doi.org/10.1201/9780429258411.
Halekoh, Ulrich, and Søren Højsgaard. 2025. doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities. https://CRAN.R-project.org/package=doBy.
Kennedy, Scott R., Michael W. Schmitt, Edward J. Fox, Brendan F. Kohrn, Jesse J. Salk, Eun Hyun Ahn, Marc J. Prindle, et al. 2014. “Detecting Ultralow-Frequency Mutations by Duplex Sequencing.” Nature Protocols 9 (11): 2586–606. https://doi.org/10.1038/nprot.2014.170.
Khandekar, Azhar, Raviteja Vangara, Mark Barnes, Marcos Díaz-Gay, Ammal Abbasi, Erik N. Bergstrom, Christopher D. Steele, Nischalan Pillay, and Ludmil B. Alexandrov. 2023. “Visualizing and Exploring Patterns of Large Mutational Events with SigProfilerMatrixGenerator.” BMC Genomics 24 (1): 469. https://doi.org/10.1186/s12864-023-09584-y.
LeBlanc, Danielle P. M., Matthew Meier, Fang Yin Lo, Elizabeth Schmidt, Charles Valentine, Andrew Williams, Jesse J. Salk, Carole L. Yauk, and Francesco Marchetti. 2022. “Duplex Sequencing Identifies Genomic Features That Determine Susceptibility to Benzo(a)pyrene-Induced in Vivo Mutations.” BMC Genomics 23 (1): 542. https://doi.org/10.1186/s12864-022-08752-w.
Marchetti, Francesco, Renato Cardoso, Connie L. Chen, George R. Douglas, Joanne Elloway, Patricia A. Escobar, Tod Harper Jr, et al. 2023. “Error-Corrected Next-Generation Sequencing to Advance Nonclinical Genotoxicity and Carcinogenicity Testing.” Nature Reviews Drug Discovery 22 (3): 165–66. https://doi.org/10.1038/d41573-023-00014-y.
Marchetti, Francesco, Renato Cardoso, Connie L. Chen, George R. Douglas, Joanne Elloway, Patricia A. Escobar, Tod Harper, et al. 2023. “Error-Corrected Next Generation Sequencing - Promises and Challenges for Genotoxicity and Cancer Risk Assessment.” Mutation Research. Reviews in Mutation Research 792: 108466. https://doi.org/10.1016/j.mrrev.2023.108466.
Menon, Vijay, and Douglas E. Brash. 2023. “Next-Generation Sequencing Methodologies to Detect Low-Frequency Mutations: Catch Me If You Can’.” Mutation Research/Reviews in Mutation Research 792 (July): 108471. https://doi.org/10.1016/j.mrrev.2023.108471.
Piegorsch, W W, and A J Bailer. 1994. “Statistical Approaches for Analyzing Mutational Spectra: Some Recommendations for Categorical Data.” Genetics 136 (1): 403–16. https://doi.org/10.1093/genetics/136.1.403.
Wheeler, Matthew W., Todd Blessinger, Kan Shao, Bruce C. Allen, Louis Olszyk, J. Allen Davis, and Jeffrey S. Gift. 2020. “Quantitative Risk Assessment: Developing a Bayesian Approach to Dichotomous Dose-Response Uncertainty.” Risk Analysis: An Official Publication of the Society for Risk Analysis 40 (9): 1706–22. https://doi.org/10.1111/risa.13537.
Wheeler, Matthew W., Jose Cortinas, Marc Aerts, Jeffery S. Gift, and J. Allen Davis. 2022. “Continuous Model Averaging for Benchmark Dose Analysis: Averaging Over Distributional Forms.” Environmetrics 33 (5): e2728. https://doi.org/10.1002/env.2728.
White, Paul A., Alexandra S. Long, and George E. Johnson. 2020. “Quantitative Interpretation of Genetic Toxicity DoseResponse Data for Risk Assessment and Regulatory DecisionMaking: Current Status and Emerging Priorities.” Environmental and Molecular Mutagenesis 61 (1): 66–83. https://doi.org/10.1002/em.22351.