Efficiently Supporting Edit Distance Based String Similarity Search Using B<formula formulatype='inline'><tex Notation='TeX'> $^+$</tex><mathgraphic graphicformat='GIF' fileref='lu-ieq1-2309131.gif'/></formula>-Trees

Lu W.; Du X.; Hadjieleftheriou M.; Ooi B.C.

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees

【24h】

Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees

机译：使用B $ ^ + $ -树

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.

机译：编辑距离被广泛用于测量两个字符串之间的相似度。作为基本操作，基于编辑距离的字符串相似度搜索是使用编辑距离在集合中查找与给定查询字符串相似的字符串。回答此类字符串相似性查询的现有方法通过使用各种索引来遵循过滤验证框架。通常，大多数方法都假定索引和数据集保留在主内存中。为了克服这一限制，在本文中，我们提出了一种基于B树的方法来回答基于编辑距离的字符串相似性查询，因此，我们的方法可以轻松地集成到现有的RDBMS中。通常，我们使用在度量空间中采用的修剪技术来回答字符串相似性搜索，因为编辑距离是一个度量。首先，我们根据一组参考字符串将字符串集合划分为多个分区。然后，我们根据这些字符串到其相应参考字符串的距离，使用单个B树在所有分区中为这些字符串编制索引。最后，我们提出了两种基于B树的有效回答范围查询和KNN查询的方法。我们证明数据集的最佳划分是一个NP难题，因此提出了一种启发式方法，用于贪婪地选择参考字符串，并提出了一种最佳划分分配策略，以最大程度地减少在验证过程中需要验证的预期字符串数。查询评估。通过在各种真实数据集上进行的广泛实验，我们证明了在大多数情况下，基于B树的方法在范围和KNN查询上均提供优于最新技术的性能。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2014年第12期|2983-2996|共14页
作者
Lu W.; Du X.; Hadjieleftheriou M.; Ooi B.C.;
展开▼
作者单位

School of Computing, National University of Singapore, Singapore;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Educational institutions; Electronic mail; Indexing; NP-hard problem; Partitioning algorithms; Bformula formulatype="inline"tex Notation="TeX"$^+$/texmathgraphic graphicformat="GIF" fileref="lu-ieq6-2309131.gif"/ /formula-tree; Similarity search; edit distance; string;

机译：教育机构;电子邮件;索引;NP难题;分区算法;B 公式Formulatype =“ inline” tex注释=“ TeX” $ ^ + $ / tex 数学图形格式=“ GIF” fileref = “ lu-ieq6-2309131.gif” / / formula -tree;相似搜索;编辑距离;字符串;

相似文献

外文文献
中文文献
专利

1. Fault-Free Hamiltonian Cycles Passing through Prescribed Edges in src="/images/tex/348.gif" alt="k"> -Ary src="/images/tex/21855.gif" alt="n"> -Cubes with Faulty Edges [J] . Zhang S., Zhang X. Parallel and Distributed Systems, IEEE Transactions on . 2015,第2期

机译：无故障的哈密顿循环经过中的指定边沿 src =“ / images / tex / 348.gif” alt =“ k”> -Ary src =” / images / tex / 21855.gif“ alt =” n“> -具有错误边缘的多维数据集
2. $schmi{GF(2^n)}$ Shifted Polynomial Basis Multipliers Based on Subquadratic Toeplitz Matrix-Vector Product Approach for All Irreducible Pentanomials [J] . Han J., Fan H. Computers, IEEE Transactions on . 2015,第3期

机译： $ schmi {GF（2 ^ n）} $ 基于二次Toeplitz矩阵-向量乘积法的所有不可约五项式的移位多项式基乘
3. Distributed $k$ -Core View Materializationand Maintenance for Large Dynamic Graphs [J] . Aksu H., Canim M., Chang Y.-C., Knowledge and Data Engineering, IEEE Transactions on . 2014,第10期

机译：分布式 $ k $ -Core查看大型动态图的实现和维护
4. B~(ed)-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance [C] . Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, ACM SIGMOD international conference on management of data;SIGMOD 2010 . 2010

机译：B〜（ed）-Tree：基于编辑距离的字符串相似度搜索的通用索引结构
5. A 265 V$_{m RMS}$ Mains Interface Integrated in 0.35 $mu$m CMOS [O] . H. Meyvaert, P. Smeets, M. Steyaert 2013

机译：A 265 V <公式甲型键=“内联”> $ _ { RM RMS} $ MAINS接口集成在0.35 <公式甲型型=“内联”> $ mu $ m cmos

Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅