Approximate word matches between two random sequences
Date
2008
Authors
Burden, Conrad J
Kantorovitz, Miriam R
Wilson, Susan R
Journal Title
Journal ISSN
Volume Title
Publisher
Institute of Mathematical Statistics
Abstract
Given two sequences over a finite alphabet L, the D₂ statistic is the
number of m-letter word matches between the two sequences. This statistic
is used in bioinformatics for expressed sequence tag database searches.
Here we study a generalization of the D₂ statistic in the context of DNA sequences,
under the assumption of strand symmetric Bernoulli text. For k<m,
we look at the count of m-letter word matches with up to k mismatches. For
this statistic, we compute the expectation, give upper and lower bounds for
the variance and prove its distribution is asymptotically normal.
Description
Keywords
Keywords: DNA sequences; Sequence comparison; Word matches
Citation
Collections
Source
The Annals of Applied Probability
Type
Journal article
Book Title
Entity type
Access Statement
Open Access
License Rights
Restricted until
Downloads
File
Description
Published Version