Skip navigation
Skip navigation

The distribution of word matches between Markovian sequences with periodic boundary conditions

Burden, Conrad J; Leopardi, Paul; Foret, Sylvain

Description

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied...[Show more]

dc.contributor.authorBurden, Conrad J
dc.contributor.authorLeopardi, Paul
dc.contributor.authorForet, Sylvain
dc.date.accessioned2014-04-09T05:09:09Z
dc.date.available2014-04-09T05:09:09Z
dc.identifier.issn1066-5277
dc.identifier.urihttp://hdl.handle.net/1885/11552
dc.description.abstractWord match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome.
dc.format23 pages
dc.publisherMary Ann Liebert
dc.rightshttp://www.sherpa.ac.uk/romeo/issn/1066-5277/
dc.sourceJournal of Computational Biology 21.1 (2014): 41-63
dc.source.urihttp://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0277
dc.subjectMarkov chains
dc.subjectsequence analysis
dc.subjectstatistical models
dc.titleThe distribution of word matches between Markovian sequences with periodic boundary conditions
dc.typeJournal article
local.identifier.citationvolume21
dc.date.issued2014
local.identifier.absfor010402 - Biostatistics
local.identifier.absfor060102 - Bioinformatics
local.identifier.ariespublicationf5625xPUB6238
local.publisher.urlhttp://www.liebertpub.com/
local.type.statusPublished Version
local.contributor.affiliationBurden, Conrad J, Mathematical Sciences Institute, Australian National University
local.contributor.affiliationLeopardi, Paul, Mathematical Sciences Institute, Australian National University
local.contributor.affiliationForet, Sylvain, Research School of Biology, Australian National University
dc.relationhttp://purl.org/au-research/grants/arc/dp120101422
local.bibliographicCitation.issue1
local.bibliographicCitation.startpage41
local.bibliographicCitation.lastpage63
local.identifier.doi10.1089/cmb.2012.0277
local.identifier.absseo970101 - Expanding Knowledge in the Mathematical Sciences
dc.date.updated2015-12-11T09:40:04Z
local.identifier.scopusID2-s2.0-84891589894
local.identifier.thomsonID000329163100003
CollectionsANU Research Publications

Download

File Description SizeFormat Image
Burden et al The distribution of word matches 2014.pdf720.17 kBAdobe PDFThumbnail


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  17 November 2022/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator