Elucidating the variability and transcriptional properties of human rDNA repeats using long-read sequencing technologies.

Date

2024

Authors

Weiss, Emiliana

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Background The sequencing and assembly of the human genome have been instrumental in unravelling the regulation of gene expression and understanding the role of genome organisation. Yet, significant gaps remain, particularly concerning ribosomal RNA (rRNA) and genes (rDNA), which are organized into extensive repeat arrays and play dual roles: encoding rRNAs essential for ribosomal assembly and function as well as maintaining genome stability. Additionally, exploration of rDNA arrays within the human genome has long been hampered by the inherent limitations of short-read sequencing technologies, which fail to capture the full complexity and linear continuity of these regions. This thesis confronts challenges posed by the repetitive and variable nature of rDNA regions, especially within the intergenic spacer (IGS) regions, which have been historically difficult to analyse due to their sequence diversity. Methods To better understand rDNA structure, we implemented advanced techniques such as head-to-tail distance analysis on the HG002 sample, revealing significant unit length variability and genetic diversity, including INDELs. Additionally, our use of a polishing strategy on Nanopore ultra-long reads with PacBio HiFi reads aimed to enhance base accuracy, but also highlighted the challenges of maintaining crucial genetic variants for accurate epigenetic analysis. Results and Conclusions The study presented here represents an important advance in the genomic analysis of the rDNA by employing long-read sequencing technologies to examine the rDNA arrays in their entirety, from promoter regions to IGS areas. Our findings emphasize the need for meticulous rDNA annotation and balance in sequencing to preserve essential variations crucial for downstream applications. This includes understanding the implications of copy number variation (CNV) and sequence diversity, where sequencing technology and coverage significantly influence accuracy. We also demonstrated that low sequencing coverage and adaptive sampling can lead to underrepresentation of rDNA regions, affecting CNV estimates. Moreover, our study extends to the analysis of CpG methylation patterns within rDNA, uncovering distinct methylation states-methylated, unmethylated, and semi-methylated-each with unique regulatory roles. Machine learning models have further shown that sequence content is predictive of methylation status, underscoring the intricate relationship between genetic sequence and epigenetic regulation. Our study highlights a non-random nature of methylation within the rDNA locus by unveiling distinct methylation and association patterns between neighbouring rDNA units sharing the same methylation type. Significance and Future Directions Overall, our study provides a deeper understanding of rDNA. This study establishes a foundation for further research by showcasing the capability of long-read sequencing to advance genomic studies, particularly in analysing complex and repetitive regions. This advancement is a crucial step toward addressing the knowledge gap in understanding specific, poorly characterized regions of the human genome. Our study will be relevant to researchers studying diseases such as cancer and ribosomopathies - both of which result from abnormalities in ribosome function. Future studies could build on the work presented in this thesis to facilitate the identification of specialized ribosomes, which may differ in their role in protein synthesis under normal and disease states.

Description

Keywords

Citation

Source

Type

Thesis (PhD)

Book Title

Entity type

Access Statement

License Rights

Restricted until

2025-05-28