Average-case linear-time similar substring searching by the q-gram distance

Hiroyuki Hanada; Mineichi Kudo; Atsuyoshi Nakamura

首页> 外文期刊>Theoretical computer science >Average-case linear-time similar substring searching by the q-gram distance

【24h】

Average-case linear-time similar substring searching by the q-gram distance

机译：通过q-gram距离搜索平均情况的线性时间相似子字符串

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper we consider the problem of similar substring searching in the q-gram distance. The q-gram distance d_q(x, y) is a similarity measure between two strings x and y defined by the number of different q-grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x| + |y|) vs. O(|x‖y|), and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When "similar" means k or less in dq, their time complexities are O(|t|k + |p|) and O(|t|logk + |p|), respectively. In this paper, we propose two algorithms of average-case complexity O(|t| + |p|), although their worst-case complexities are still O(|t|k + |p|) and O(|t|logk + |p|), respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(logk) time. Experimental results support their theoretical average-case complexities.

机译：在本文中，我们考虑在q-gram距离内相似子字符串搜索的问题。 q-gram距离d_q（x，y）是两个字符串x和y之间的相似度，由两个字符串之间的不同q-gram的数量定义。由于该距离的计算成本较低（O（| x | + | y |）vs. O（|x‖y|），并且可以很好地近似于编辑距离，因此可以使用该距离代替编辑距离。但是，如果将此距离应用于在长文本t中找到给定模式p的所有相似字符串的问题，则总的计算成本有时是不可接受的。 Ukkonen已经提出了两种快速算法：一种具有数组，另一种具有树。当“相似”在dq中表示k或更小时，它们的时间复杂度分别为O（| t | k + | p |）和O（| t | logk + | p |）。在本文中，我们提出了两种平均情况下复杂度为O（| t | + | p |）的算法，尽管它们的最坏情况下复杂度仍为O（| t | k + | p |）和O（| t | logk + | p |）。在对t进行随机采样并假设q大于阈值的条件下，分析了平均情况下复杂度的线性。该算法利用了以下事实：如果子串的起始位置很近，通常会在非常接近的位置找到t中相似的子串。在第二个提出的算法中，我们采用了由数组和搜索树支持的双链表，以在O（logk）时间中搜索列表元素。实验结果支持了他们的理论平均情况下的复杂性。

著录项

来源
《Theoretical computer science》 |2014年第null期|共19页
作者
Hiroyuki Hanada; Mineichi Kudo; Atsuyoshi Nakamura;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类一般性问题;数学;
关键词
String searching; Approximate string matching; Q-Gram distance;

机译：字符串搜索;近似字符串匹配;Q-Gram距离;

相似文献

外文文献
中文文献
专利

1. Average-case linear-time similar substring searching by the q-gram distance [J] . Hiroyuki Hanada, Mineichi Kudo, Atsuyoshi Nakamura Theoretical computer science . 2014,第Null期

机译：通过q-gram距离搜索平均情况的线性时间相似子字符串
2. Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time [J] . Maxime Crochemore, Alexandre P. Francisco, Solon P. Pissis, LIPIcs : Leibniz International Proceedings in Informatics . 2017,第1期

机译：走向平均情况线性时间中基于距离的系统发育推断
3. Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time [J] . Maxime Crochemore, Alexandre P. Francisco, Solon P. Pissis, LIPIcs : Leibniz International Proceedings in Informatics . 2017,第29期

机译：走向平均情况线性时间中基于距离的系统发育推断
4. Quasi-linear-time substring searching by q-gram distance [C] . Hanada Hiroyuki, Kudo Mineichi, Nakamura Atsuyoshi 2012 6th International Conference on New Trends in Information Science, Service Science and Data Mining . 2012

机译：通过q-gram距离搜索准线性时间子串
5. Discovering motifs in DNA and protein sequences: The approximate common substring problem. [D] . Bailey, Timothy Lawrence. 1995

机译：在DNA和蛋白质序列中发现基序：近似的常见子串问题。
6. Linear-time protein 3-D structure searching with insertions and deletions [O] . Tetsuo Shibuya, Jesper Jansson, Kunihiko Sadakane 2010

机译：具有插入和缺失的线性蛋白质3D结构搜索
7. Average-case linear-time similar substring searching by the q-gram distance [O] . Hanada, Hiroyuki, Kudo, Mineichi, Nakamura, Atsuyoshi 2014

机译：通过q-gram距离搜索平均情况的线性时间相似子串

Average-case linear-time similar substring searching by the q-gram distance

摘要

著录项

相似文献

相关主题

期刊订阅