Generalisable Methods for Improving CRISPR Efficiency and Outcome Specificity using Machine Learning Algorithms

Date

2021

Authors

O'Brien, Aidan

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

CRISPR (clustered regularly interspaced short palindromic repeats) based genome editing has become a popular tool for a range of disciplines, including microbiology, agricultural science, and health. Driving these applications is the ability of the "programmable" system to target a predefined location in the genome. A single guide RNA (sgRNA) defines the target through Watson-Crick base pairing, and a class 2 type II CRISPR associated protein 9 (Cas9) nuclease cleaves the target, resulting in a double-strand break (DSB). This activates DNA repair, and depending on the repair pathway initiated, can result in arbitrary insertions/deletions or a predefined variant. Despite the versatility and ease of design enabled by this RNA-guided nuclease, it lacks specificity, regarding off-target effects, and efficiency, regarding the rate of successful editing outcomes. The overarching hypothesis of my thesis is to solve the disadvantages of CRISPR systems by using machine learning to train generalisable models on existing and novel datasets. One pathway that demonstrates the need for prediction models is homology directed repair (HDR). HDR enables researchers to induce nearly any editing outcome, however, it is inefficient. And with an incomplete knowledge of its kinetics, no models existed for predicting its efficiency. I generated a novel dataset representing the efficiency of HDR. Using the Random Forests algorithm, I identified the sgRNA and the 3' region of the template to modulate HDR efficiency. This novel finding relates to the kinetics of template interaction during HDR repair. Even with efficient gene editing, a potential problem is unwanted side effects, such as embryonic lethality. This can be solved by using CRISPR to create conditional knockout alleles, to control when and where knockouts occur. To investigate the efficiency of this process, I used statistical analyses and the Random Forest algorithm to analyse a dataset generated by a consortium of 19 laboratories. I identified the inherent inefficiency of this method as defined by the efficiency of two simultaneous HDR events. Other experimental variables, like reagent concentrations or technician skill level, had no significant influence on efficiency. Because of the unrivalled versatility of this method, I created a statistical model for forecasting the efficiency of this technique from a low number of attempts, aiming to overcome its inherent inefficiency. While Cas9 is the most cited CRISPR system, alternative CRISPR systems can further expand the gene editing repertoire. To support the uptake of the more-recent Cas12a, I performed a comprehensive comparison between the two nucleases. I found support for Cas12a having a superior specificity. Despite this, editing outcome and efficiency prediction tools for Cas12a were scarce. Aiming to address this, I trained a Cas12a cleavage efficiency prediction model on representative data. This outperformed the current top model despite the dataset being 300x smaller, demonstrating the importance of clean data. Altogether, this thesis improves the knowledge of different CRISPR gene editing techniques. These findings can enable researchers to design efficient experiments as well as provide researchers guidance where certain techniques may be inherently inefficient. As well as resulting in CUNE (Computational Universal Nucleotide Editor) and Cas12aRF, it also identifies the generalisability of prediction models due to the high degree of influence on efficiency by the sgRNA and repair template design.

Description

Keywords

Citation

Source

Type

Thesis (PhD)

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads

File
Description