Ly, Trong Nhan2024-11-162024-11-16https://hdl.handle.net/1885/733724780Phylogenetic inference is fundamental in biology, elucidating evolutionary relationships among organisms and shedding light on species' origins. During the COVID-19 pandemic, phylogenetic methods are pivotal in discovering the origin, detecting new SARS-CoV-2 variants, and guiding public health decisions. Phylogenetics also plays a key role in understanding Earth's biodiversity, spanning from the origins of life billions of years ago to the evolution of modern species such as plants and animals. We are now in the genomic era where the ever-growing genomic data poses new challenges - making existing phylogenetic tools intractable, calling for an urgent need for new efficient methods. To fill this gap, I developed several new phylogenetic methods tailored for large-scale genomic data. Firstly, I developed AliSim, a new phylogenetic sequence simulator that efficiently simulates genomic alignments under a wide range of realistic evolutionary models. To optimize performance across diverse simulation conditions, I devised an adaptive simulation approach that combines the rate matrix and probability matrix approaches. AliSim takes 1.4 hours and 1.3 GB RAM to simulate alignments with one million sequences or sites, while popular software such as Seq-Gen, Dawg, and INDELible require two to five hours and 50 to 500 GB of RAM. Secondly, I devised AliSim-HPC, the high-performance-computing version of AliSim. I implemented four parallel algorithms using OpenMP and the Message Passing Interface (MPI) to parallelize simulations across multiple cores and CPUs, achieving exceptional scalability. For example, AliSim-HPC simulates 100 large alignments (30,000 sequences of one million sites) in 11 minutes using 256 CPU cores from a cluster of six computing nodes, a 153x speedup compared to the sequential version. Thirdly, I introduce CMAPLE (C++ MAximum Parsimonious Likelihood Estimation) package with two main components: (1) CMAPLE software, which can reconstruct a phylogenetic tree from one million SARS-CoV-2 genomes - the input that existing maximum likelihood methods struggle to handle; and (2) CMAPLE library, a suite of Application Programming Interfaces (APIs) that facilitates the integration of the CMAPLE algorithm into existing phylogenetic inference software. As a demonstration, I successfully integrated CMAPLE into the widely used IQ-TREE 2 software, facilitating its widespread adoption in the research community. Lastly, I trained TreeFormer, an extension of the Phyloformer network, to predict the pairwise distances among 20 subtrees from their partial likelihoods. TreeFormer can function as a new tree rearrangement operation to improve the tree search process. Experimental results show that TreeFormer outperforms FastTree 2 in inferring trees from real alignments with fewer than 1000 sites. Further research is needed to improve TreeFormer's accuracy. The outcomes of this thesis include three peer-reviewed articles (two published and one under review (minor revision)), two new production-level phylogenetic software packages (AliSim and CMAPLE), and one proof-of-concept machine-learning model (TreeFormer). AliSim has been cited 38 times (on Google Scholar) and adopted in several phylogenetic studies since its publication in May 2022. CMAPLE offers researchers a powerful tool for large-scale pathogen analysis, whereas TreeFormer can be integrated into existing maximum likelihood methods to improve the tree search process. This thesis will aid in addressing the big data challenge in the genomic era.en-AUPhylogenomics in the pandemic era202410.25911/XQV6-3Z60