Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing

dc.contributor.authorSneddon, Alexandraen
dc.contributor.authorMateos, Pablo Aceraen
dc.contributor.authorShirokikh, Nikolay E.en
dc.contributor.authorEyras, Eduardoen
dc.date.accessioned2025-06-29T16:34:16Z
dc.date.available2025-06-29T16:34:16Z
dc.date.issued2022en
dc.description.abstractAlgorithms developed for basecalling Nanopore signals have primarily focused on DNA to date and utilise the raw signal as the only input. However, it is known that messenger RNA (mRNA), which dominates Nanopore direct RNA (dRNA) sequencing libraries, contains specific nucleotide patterns that are implicitly encoded in the Nanopore signals since RNA is always sequenced from the 3' to 5' direction. In this study we present an approach to exploit the sequence biases in mRNA as an additional input to dRNA basecalling. We developed a probabilistic model of mRNA language and propose a modified CTC beam search decoding algorithm to conditionally incorporate the language model during basecalling. Our findings demonstrate that inclusion of mRNA language is able to guide CTC beam search decoding towards the more probable nucleotide sequence. We also propose a time efficient approach to decoding variable length nanopore signals. This work provides the first demonstration of the potential for biological language to inform Nanopore basecalling. Code is available at: https://github.com/comprna/radian.en
dc.description.sponsorshipThis research was supported by the Australian Research Council (ARC) Discovery Project grants DP210102385 (to EE) and DP220101352 (to EE), by a grant from the Bootes Foundation (to NS and PAM), by an Australian Government Research Training Program (RTP) scholarship (to AS), and by the National Health and Medical Research Council (NHMRC) Investigator Grant GNT1175388 (to NS).en
dc.description.statusPeer-revieweden
dc.format.extent16en
dc.identifier.otherORCID:/0000-0001-8249-358X/work/164351136en
dc.identifier.otherORCID:/0000-0003-0793-6218/work/164351540en
dc.identifier.otherORCID:/0000-0001-6199-7439/work/164354403en
dc.identifier.scopus85164540045en
dc.identifier.urihttp://www.scopus.com/inward/record.url?scp=85164540045&partnerID=8YFLogxKen
dc.identifier.urihttps://proceedings.mlr.press/v200/sneddon22a.htmlen
dc.identifier.urihttps://hdl.handle.net/1885/733765341
dc.language.isoenen
dc.relation.ispartofseries17th Machine Learning in Computational Biology Meeting, MLCB 2022en
dc.rightsPublisher Copyright: © MLCB 2022.en
dc.sourceProceedings of Machine Learning Researchen
dc.titleLanguage-Informed Basecalling Architecture for Nanopore Direct RNA Sequencingen
dc.typeConference paperen
dspace.entity.typePublicationen
local.bibliographicCitation.lastpage165en
local.bibliographicCitation.startpage150en
local.contributor.affiliationSneddon, Alexandra; John Curtin School of Medical Research, ANU College of Science and Medicine, The Australian National Universityen
local.contributor.affiliationMateos, Pablo Acera; John Curtin School of Medical Research, ANU College of Science and Medicine, The Australian National Universityen
local.contributor.affiliationShirokikh, Nikolay E.; John Curtin School of Medical Research, ANU College of Science and Medicine, The Australian National Universityen
local.contributor.affiliationEyras, Eduardo; John Curtin School of Medical Research, ANU College of Science and Medicine, The Australian National Universityen
local.identifier.ariespublicationa383154xPUB42613en
local.identifier.citationvolume200en
local.identifier.doi10.1101/2022.10.19.512968en
local.identifier.pure4c17e029-8861-4345-8670-6dccf68c1ae5en
local.identifier.urlhttps://www.scopus.com/pages/publications/85164540045en
local.identifier.urlhttps://proceedings.mlr.press/v200/sneddon22a.htmlen
local.type.statusPublisheden

Downloads