An efficient algorithm is described for finding matches, repeats and other word relations, allowing for errors, in large data sets of long molecular sequences. The algorithm entails hashing on fixed-size words in conjunction with the use of a linked list connecting all occurrences of the same word. The average memory and run time requirement both increase almost linearly with the total sequence length. Some results of the program’s performance on a database of Escherichia coli DNA sequences are presented.
展开▼