Approximate word matches between two random sequences

Burden, Conrad J; Kantorovitz, Miriam R; Wilson, Susan R

Approximate word matches between two random sequences

Date

2008

Authors

Burden, Conrad J

Kantorovitz, Miriam R

Wilson, Susan R

Publisher

Institute of Mathematical Statistics

Abstract

Given two sequences over a finite alphabet L, the D₂ statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D₂ statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Keywords

Keywords: DNA sequences; Sequence comparison; Word matches

URI

http://hdl.handle.net/1885/100208

Collections

ANU Research Publications

Source

The Annals of Applied Probability

Type

Journal article

Access Statement

Open Access

DOI

10.1214/07-AAP452

Downloads

File

Description

01_Burden_Approximate_word_matches_2008.pdf (411.81 KB)

Published Version

Full item page

Approximate word matches between two random sequences

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until

Downloads