Algorithms for estimating rates of nucleotide change

Date

2021

Authors

Tang, Yurong

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The rapidly reducing cost of high throughput sequencing allows for the acquisition of genome-wide data for estimating nucleotide rates not only including mutation rates within species, but including substitution rates between species from multiple sequence alignments. To address the problem of estimating mutation rates within species, Vogl (2014) has developed a general algorithm for the case of bi-allelic neutral evolution. A necessary first step in generalizing the Vogl estimator to the multi-allele case is the generalisation of Wright's stationary beta distribution to higher dimensions. This involves finding a stationary solution to the multi-allelic forward Kolmogorov equation. We present an approximate analytic solution to the neutrally evolving multi-allelic forward Kolmogorov equation in the form of a set of line densities defined on the edges of the solution simplex for the general case of K alleles. The solution is obtained in terms of a parameterisation in which the rate matrix Q is decomposed into the sum of a time-reversible part and a non-reversible `flux' part. The accuracy of the approximate solution with k = 3 and k = 4 alleles is illustrated using simulation data. The result shows that the agreement between simulation and theory is very close. Based on the approximate analytic solution, we address the problem of estimating a mutation rate matrix from site frequency data. The data is assumed to take the form of a multiple alignment of independent, neutrally evolving genomic sites sequenced from a moderate number of individuals chosen independently from a large effective population. We have demonstrated that it is possible, in principle, to estimate an evolutionary rate matrix from the site frequency spectrum of an alignment of genomes sampled from a population, provided certain conditions are met. Nucleotide substitution rate matrices between species are generally used to calculate the likelihood of phylogenetic trees. In phylogenetic reconstruction, the assumption of heterogeneity in the substitution process across lineages is supported by evidence of compositional heterogeneity between the sequences. However, the total number of possible ways of reducing heterogeneity among lineages is enormous for even a modest number of taxa (Jayaswal et al., 2011). An efficient strategy for model selection is required to identify an 'optimal' model for a data set. In this study, we address these issues with a novel model selection algorithm we term the Inherited Rate Matrix algorithm (hereafter IRM). This approach is based on the notion that a species inherits the substitution tendencies of its ancestor. We further present the non-stationary heterogeneous across lineages model (hereafter ns-HAL algorithm) which extends the HAL algorithm of Jayaswal et al. [2014] to the general nucleotide Markov process. The IRM algorithm substantially reduces the complexity of identifying a sufficient solution to the problem of time-heterogeneous substitution processes across lineages. We also address the issue of reducing the computing time with development of a constrained-optimisation approach for the IRM algorithm (fast-IRM). From a simulation study based on 2nd codon position genome sequences of yeast, we establish that IRM is significantly more accurate than both ns-HAL and heterogeneity in the substitution process across lineages (hereafter HAL) for close and dispersed sequences. IRM and fast-IRM are faster than ns-HAL. fast-IRM also showed a marked speed improvement over a C++ implementation of HAL. Our comparison of the accuracy of IRM with fast-IRM showed no difference with identical inferences made for all data sets. These two algorithms greatly improve the compute time for model selection of a non-stationary process, increasing the suite of problems to which this important substitution model class can be applied.

Description

Keywords

Citation

Source

Type

Thesis (MPhil)

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads

File
Description