Systematic bias in phylogenetic inference: Implications, Assessment, and Reduction
Abstract
Molecular phylogenetic inference is the process of reconstructing relationships between individuals, species, or higher groups from genomic sequence data. The reliability of phylogenetic analysis relies on the fit between the substitution models used and the evolutionary processes that generated the data. In phylogenetic inference, we commonly use substitution models which assume that sequence evolution is stationary, reversible, and homogeneous (SRH). Many empirical and simulation studies have shown that assuming SRH conditions can lead to significant errors in phylogenetic inference when the data violates these assumptions. Yet, the extent of SRH violations and their effects on phylogenetic inference of tree topologies are not very well understood.
In Chapter I, I introduced and applied the Maximal matched-pairs tests of homogeneity (MaxSym tests) to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic data sets. I showed that roughly one-quarter of all the partitions I analysed reject the SRH assumptions and that for more than one-quarter of data sets, tree topologies inferred from all partitions differ significantly from topologies inferred using the subset of partitions that do not reject the SRH assumptions.
In Chapter II, I simulated datasets under various degrees of non-SRH conditions using empirically derived parameters to mimic real data and examine the effects of incorrectly assuming SRH conditions on inferring phylogenies. I showed that maximum likelihood inference is generally quite robust to a wide range of SRH model violations but is inaccurate under extreme convergent evolution. In addition, I tested the power of the MaxSym tests and other popular tests to detect model violations due to non-SRH evolution. I showed that MaxSym tests performed well under the different schemes of simulations, and that of all the tests I studies, the MaxSym tests perform the best at identifying datasets that might mislead phylogenetic inference.
In Chapter III, I investigated the ability of non-reversible models to estimate the root of a phylogeny. In addition, I introduced a new measure of support for the placement of the root in a phylogenetic tree, the rootstrap support. I tested the ability of nonreversible models to recover the root placement of five clades of mammals for which prior studies give very strong evidence of a particular root position. I showed that the nonreversible model correctly inferred the root of all the five clades with very high rootstrap support. I then applied the same approaches to infer the roots of two clades of mammals for which previous studies have repeatedly disagreed on the root position. I show that nonreversible models recover similar roots to previous studies, but the rootstrap support is lower than the other five clades.
In Chapter IV, I investigated the homogeneity assumption widely used in phylogenetic inference. To check for homogeneity in empirical datasets, I introduced a computationally feasible test for homogeneity across lineages based on the AIC score. Using empirical datasets from three different clades of life I tested the homogeneity assumption by estimating amino-acid substitution matrices for monophyletic sub-clades within each dataset. I show that forcing the models to be homogenous always provides a worse fit to the data than allowing each sub-clade to have its own model. In addition, for every dataset, I found that a simpler model where two or more clades share the same substitution matrix is always better than the fully non-homogeneous model in terms of AIC score.
Together, these chapters show the impact of model violation due to non-SRH evolution on phylogenetic inference and suggest the need to test for model violation prior to phylogenetic inference, or to develop and apply more complex substitution models to relax some of the assumptions associated with the most widely used models in phylogenetics.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material