Approximate word matches between two random sequences

Date

2008

Authors

Burden, Conrad J
Kantorovitz, Miriam R
Wilson, Susan R

Journal Title

Journal ISSN

Volume Title

Publisher

Institute of Mathematical Statistics

Abstract

Given two sequences over a finite alphabet L, the D₂ statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D₂ statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Description

Keywords

Keywords: DNA sequences; Sequence comparison; Word matches

Citation

Source

The Annals of Applied Probability

Type

Journal article

Book Title

Entity type

Access Statement

Open Access

License Rights

Restricted until

Downloads