Weighted k-word matches: a sequence comparison tool for proteins

Jing, JunmeiWilson, SusanBurden, Conrad2015-12-071446-8735http://hdl.handle.net/1885/22183The use of k-word matches was developed as a fast alignment-free comparison method for dna sequences in cases where long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. Here we extend the algorithm to amino acid sequences. We define a new statistic, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids. We computed the mean and variance, and simulated the distribution function for various forms of this statistic for sequences of identically and independently distributed letters. We present these results and a method for choosing an optimal word size. The efficiency of the method is tested by using simulated evolutionary sequences, and the results compared with blast.Weighted k-word matches: a sequence comparison tool for proteins20112015-12-07