首页> 外文期刊>Theoretical computer science >Average-case linear-time similar substring searching by the q-gram distance
【24h】

Average-case linear-time similar substring searching by the q-gram distance

机译:通过q-gram距离搜索平均情况的线性时间相似子字符串

获取原文
获取原文并翻译 | 示例
       

摘要

In this paper we consider the problem of similar substring searching in the q-gram distance. The q-gram distance d_q(x, y) is a similarity measure between two strings x and y defined by the number of different q-grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x| + |y|) vs. O(|x‖y|), and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When "similar" means k or less in dq, their time complexities are O(|t|k + |p|) and O(|t|logk + |p|), respectively. In this paper, we propose two algorithms of average-case complexity O(|t| + |p|), although their worst-case complexities are still O(|t|k + |p|) and O(|t|logk + |p|), respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(logk) time. Experimental results support their theoretical average-case complexities.
机译:在本文中,我们考虑在q-gram距离内相似子字符串搜索的问题。 q-gram距离d_q(x,y)是两个字符串x和y之间的相似度,由两个字符串之间的不同q-gram的数量定义。由于该距离的计算成本较低(O(| x | + | y |)vs. O(|x‖y|),并且可以很好地近似于编辑距离,因此可以使用该距离代替编辑距离。但是,如果将此距离应用于在长文本t中找到给定模式p的所有相似字符串的问题,则总的计算成本有时是不可接受的。 Ukkonen已经提出了两种快速算法:一种具有数组,另一种具有树。当“相似”在dq中表示k或更小时,它们的时间复杂度分别为O(| t | k + | p |)和O(| t | logk + | p |)。在本文中,我们提出了两种平均情况下复杂度为O(| t | + | p |)的算法,尽管它们的最坏情况下复杂度仍为O(| t | k + | p |)和O(| t | logk + | p |)。在对t进行随机采样并假设q大于阈值的条件下,分析了平均情况下复杂度的线性。该算法利用了以下事实:如果子串的起始位置很近,通常会在非常接近的位置找到t中相似的子串。在第二个提出的算法中,我们采用了由数组和搜索树支持的双链表,以在O(logk)时间中搜索列表元素。实验结果支持了他们的理论平均情况下的复杂性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号