Eucalyptus pauciflora (E. pauciflora) is a long-lived tree which displays the broadest altitudinal range of the ~900 species of eucalypt. However, its genome resources are limited. In this thesis, I assembled and assessed the chloroplast genome, mitochondrial genome and nuclear genome of E. pauciflora with a suite of new methods. I then used these resources to tackle a set of key biological questions about genome evolution. These include the existence of heteroplasmy in organellar genomes, the...[Show more] intracellular transfer of DNA between genomes within individuals, and some surprising features of the chloroplast genome not just within eucalypts but also across the angiosperm tree of life.
In Chapter I, I assembled the E. pauciflora chloroplast genome with long-reads and short-reads, which resolved a prevalent issue that assembling the complete chloroplast genome requires the assistance of manually post-assembly process. In addition, by comparing thousands of assemblies across a broad range of input data coverage, I was able to identify the minimum requirement of input long-read and short-read coverage needed to generate a complete and accurate chloroplast genome assembly.
In Chapter II, I detailed a new method to detect and quantify chloroplast genome structural haplotypes from long-read sequencing data. With this method, I conducted a systematic analysis and quantification of chloroplast structural haplotypes in 61 land plant species across 19 orders of Angiosperms, Gymnosperms, and Pteridophytes, and showed that most species contain just two structural haplotypes with equal frequency. I discuss how my results suggest that the formation of these haplotypes could be related to the large inverted regions in chloroplast genome.
In Chapter III, I expanded my long-read based method to discover whether chloroplast genome is circular or linear with well-defined end points. I first showed with simulated data that this method can detect the linear chloroplast genome with well-defined end points. However, it displayed a strong sequencing technique bias from real data. The defined end signal is observed in almost all PacBio sequencing datasets, but not in the nanopore sequencing datasets. I discuss a range of possible explanations for these findings.
In Chapter IV, I de novo assembled the E. pauciflora and E. grandis mitochondrial genomes. Both consist of six possible structures. Through the comparative structural analysis, I reveal that this structure most likely evolved independently evolution in the evolutionary lineages leading to E. pauciflora and E. grandis. My results imply that the published E. grandis mitochondrial genome, which is a single linear genome, may need to be revisited.
In Chapter V, I performed a study of single-nucleotide heteroplasmy in organellar genomes. Some studies have detected heteroplasmy in organellar genomes, but others have suggested that this heteroplasmy could be result from a combination of DNA transfer between organelles, and contamination of other organellar genomes in DNA extraction and/or bioinformatic methods. I used long-read data from E. pauciflora to first map all DNA transfer regions between genomes, and to then assess evidence for heteroplasmy with long reads that are longer than the DNA transfer regions. I found limited evidence for the presence of heteroplasmy in organellar genomes, suggesting that previous results are likely due to contamination.
In Chapter VI, I de novo assembled the E. pauciflora nuclear genome and developed new metrics to compare different genome assemblies from the same data. I built a pipeline that estimates these metrics and a range of related metrics, allowing researchers to quickly and efficiently choose the best genome assembly for any non-model species.
In sum, this thesis broadens our understandings of organellar and nuclear genome assembly and provided some biological insights into organellar genomes.
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.