B~(ed)-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance

机译：b〜（ed） - 基于编辑距离的字符串相似性搜索的通用索引结构

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the B~(ed)-tree, a B~+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the B~(ed)-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

机译：串在计算机系统中无处不在，因此字符串处理吸引了来自不同地区计算机科学家的广泛研究工作。字符串处理中最重要的问题是基于指定的相似度测量有效地评估两个字符串之间的相似性。字符串相似性搜索是信息检索，数据库清洁，生物序列分析等的基本问题。虽然已经提出了大量对字符串的不相似措施，但编辑距离是广泛应用中最受欢迎的选择。基于编辑距离的相似性搜索查询的现有索引技术，例如，近似选择和加入查询，主要依赖于与反相列表结构耦合的N-GRAM签名。这些技术仅针对特定查询类型量身定制，它们的性能尤其在具有严格的内存约束或频繁数据更新的情况下仍然不令人满意。在本文中，我们提出了B〜（ED）-Tree，基于B + -Tree的索引结构，用于评估编辑距离和归一化编辑距离上的所有类型的相似性查询。我们确定从字符串空间到整数空间的映射的必要属性，以支持这些查询的搜索和修剪。提出了三种转换，捕获字符串中固有的信息的不同方面，在树上的搜索过程中实现有效修剪。与字符串相似性搜索的最先进方法相比，B〜（ED）-Tree是满足所有应用要求的完整解决方案，提供高可扩展性和快速响应时间。

著录项

来源
《ACM SIGMOD international conference on management of data》|2010年||共12页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词
string; edit distance; similarity search; range query; top-k query; approximate join;

机译：细绳;编辑距离;相似搜索;范围查询;top-k查询;近似联接;

相似文献

外文文献
中文文献
专利

1. Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees [J] . Lu W., Du X., Hadjieleftheriou M., Knowledge and Data Engineering, IEEE Transactions on . 2014,第12期

机译：使用B $ ^ + $ -树
2. A unified framework for string similarity search with edit-distance constraint [J] . Yu Minghe, Wang Jin, Li Guoliang, The VLDB journal . 2017,第2期

机译：具有编辑距离约束的字符串相似性搜索的统一框架
3. Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes [J] . Taku Aratsu, rnKouichi Hirata, rnTetsuji Kuboyama Fundamenta Informaticae . 2010,第3期

机译：通过二叉树代码的字符串编辑距离近似树编辑距离
4. B~(ed)-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance [C] . Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, ACM SIGMOD international conference on management of data;SIGMOD 2010 . 2010

机译：B〜（ed）-Tree：基于编辑距离的字符串相似度搜索的通用索引结构
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures [O] . Daiji Fukagawa, Takeyuki Tamura, Atsuhiro Takasu, 2011

机译：基于团的无序树间编辑距离方法及其在糖链结构分析中的应用
7. 1Efficiently Supporting Edit Distance based String Similarity Search Using B+-trees [O] . Wei Lu, Xiaoyong Du, Marios Hadjieleftheriou, 2014

机译：1使用B +树有效地支持基于编辑距离的字符串相似性搜索

B~(ed)-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance

摘要

著录项

相似文献

相关主题

期刊订阅