【24h】

Approximate Matching in Weighted Sequences

机译:加权序列中的近似匹配

获取原文
获取原文并翻译 | 示例

摘要

Weighted sequences have been recently introduced as a tool to handle a set of sequences that are not identical but have many local similarities. The weighted sequence is a "statistical image" of this set, where the probability of every symbol's occurrence at every text location is given. We address the problem of approximately matching a pattern in such a weighted sequence. The pattern is a given string and we seek all locations in the set where the pattern occurs with a high enough probability. We define the notion of Hamming distance and edit distance in weighted sequences and give efficient algorithms for computing them. We compute two versions of the Hamming distance in time O(n (m log m)~(1/2)), where n is the length of the weighted text and m is the pattern length. The edit distance is computed in time O(nm) and O(nm~2), depending on the edit distance definition used. Unfortunately, due to space considerations, the edit distance details are left to the journal version. We also define the notion of weighted matching in infinite alphabets and show that exact weighted matching can be computed in time O(s log~2 s), where s is the number of text symbols having non-zero probability. The weighted Hamming distance over infinite alphabets can be computed in time min(O(kn s~(1/2) + s~(3/2) log~2 s), O(s~(4/3)m~(1/3) log s)).
机译:最近引入了加权序列作为处理一组不相同但具有许多局部相似性的序列的工具。加权序列是该集合的“统计图像”,其中给出了每个符号在每个文本位置出现的概率。我们解决了在这种加权序列中近似匹配模式的问题。模式是给定的字符串,我们以足够高的概率查找集合中模式发生的所有位置。我们定义了汉明距离的概念并按加权序列编辑距离,并给出了有效的算法来计算它们。我们在时间O(n(m log m)〜(1/2))中计算汉明距离的两个版本,其中n是加权文本的长度,m是图案长度。根据使用的编辑距离定义,以时间O(nm)和O(nm〜2)计算编辑距离。不幸的是,出于空间考虑,编辑距离的详细信息留给日志版本。我们还定义了无限字母中的加权匹配的概念,并表明可以在时间O(s log〜2 s)中计算出精确的加权匹配,其中s是具有非零概率的文本符号的数量。无限字母上的加权汉明距离可以在时间min(O(kn s〜(1/2)+ s〜(3/2)log〜2 s),O(s〜(4/3)m〜( 1/3)log s))。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号