首页> 外文会议>ACM SIGMOD international conference on management of data >B~(ed)-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance
【24h】

B~(ed)-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance

机译:b〜(ed) - 基于编辑距离的字符串相似性搜索的通用索引结构

获取原文

摘要

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the B~(ed)-tree, a B~+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the B~(ed)-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.
机译:串在计算机系统中无处不在,因此字符串处理吸引了来自不同地区计算机科学家的广泛研究工作。字符串处理中最重要的问题是基于指定的相似度测量有效地评估两个字符串之间的相似性。字符串相似性搜索是信息检索,数据库清洁,生物序列分析等的基本问题。虽然已经提出了大量对字符串的不相似措施,但编辑距离是广泛应用中最受欢迎的选择。基于编辑距离的相似性搜索查询的现有索引技术,例如,近似选择和加入查询,主要依赖于与反相列表结构耦合的N-GRAM签名。这些技术仅针对特定查询类型量身定制,它们的性能尤其在具有严格的内存约束或频繁数据更新的情况下仍然不令人满意。在本文中,我们提出了B〜(ED)-Tree,基于B + -Tree的索引结构,用于评估编辑距离和归一化编辑距离上的所有类型的相似性查询。我们确定从字符串空间到整数空间的映射的必要属性,以支持这些查询的搜索和修剪。提出了三种转换,捕获字符串中固有的信息的不同方面,在树上的搜索过程中实现有效修剪。与字符串相似性搜索的最先进方法相比,B〜(ED)-Tree是满足所有应用要求的完整解决方案,提供高可扩展性和快速响应时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号