首页> 外文会议>LATIN'98: Theoretical informatics >Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
【24h】

Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

机译:使用后缀树拼写近似的重复主题或共同主题

获取原文
获取原文并翻译 | 示例

摘要

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet sigma. For instance, sigma may be equal to {A, C,G,T} and the sequence represents an encoding of a DNA macromolecule. The motifd searched correspond to words over th same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time. The second algorithm extracts common motifs from a set of N large than or equal to 2 sequences. In this case, the motifsmust occur, again with at most e mismatches, in 1 less than or equal to q less than or equal to N distinct sequences ofthe set. IN both cases, the words representing the motifs may never be present exactly i nthe sequences. We therefore speak of the motifs, repeatedin a sequence or common to a set of them, as being "external" objects and denote them by the expresion "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further presprocessing, uses this tree to simply "spell" the modles. Assuming an alphabet of fixed size, the total time needed is O(nN sup 2v(e,k)) using O(nN sup 2/w) space, where n is the length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e,k) is the number of words of length k at a Hamming distance at most e from another k-length word. V(e,k) may be majored by k sup e sigam sup e. This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the ocmmon motifs problem whenever N < k sigma, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in htis second case are O(nV(e,k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.
机译:我们在本文中提出两种算法。第一个从字母sigma上定义的序列中提取重复的图案。例如,σ可以等于{A,C,G,T},并且该序列表示DNA大分子的编码。所搜索的图案对应于同一字母上的单词,该单词在序列中出现最少次数q次,每次最多e个不匹配。第二种算法从N个大于或等于2个序列的集合中提取共同的主题。在这种情况下,基序必须再次出现,最多具有e个错配,且小于或等于q小于或等于集合的N个不同序列的数量为1。在这两种情况下,代表基序的单词可能永远不会准确地出现在序列中。因此,我们称这些主题为“外部”对象,这些主题按顺序重复或与它们的集合共有,如果它们验证了法定约束q,则用表达式“有效模型”表示它们。我们在这里介绍的用于找到与重复的或常见的模体相对应的所有有效模型的方法始于构建序列的后缀树,然后在进行一些进一步的预处理之后,使用该树简单地“拼写”模型。假设字母大小固定,则使用O(nN sup 2 / w)空间所需的总时间为O(nN sup 2v(e,k)),其中n是序列的长度,k是长度所寻找模型的长度或是可能的最长有效模型的长度,w是字机器的大小,V(e,k)是距另一k长度汉明距离最多e的长度k的字数字。 V(e,k)可由k sup e sigam sup e主修。这对沃特曼[23]提出的算法进行了改进。每当N

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号