We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet sigma. For instance, sigma may be equal to {A, C,G,T} and the sequence represents an encoding of a DNA macromolecule. The motifd searched correspond to words over th same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time. The second algorithm extracts common motifs from a set of N large than or equal to 2 sequences. In this case, the motifsmust occur, again with at most e mismatches, in 1 less than or equal to q less than or equal to N distinct sequences ofthe set. IN both cases, the words representing the motifs may never be present exactly i nthe sequences. We therefore speak of the motifs, repeatedin a sequence or common to a set of them, as being "external" objects and denote them by the expresion "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further presprocessing, uses this tree to simply "spell" the modles. Assuming an alphabet of fixed size, the total time needed is O(nN sup 2v(e,k)) using O(nN sup 2/w) space, where n is the length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e,k) is the number of words of length k at a Hamming distance at most e from another k-length word. V(e,k) may be majored by k sup e sigam sup e. This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the ocmmon motifs problem whenever N < k sigma, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in htis second case are O(nV(e,k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.
展开▼