首页> 外文会议>Data Engineering, ICDE, 2009 IEEE 25th International Conference on >Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
【24h】

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

机译:基于空间的语法索引,可实现有效的近似字符串搜索

获取原文

摘要

Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining ef??cient query processing. We ??rst study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the ??exibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising ??nding is that while we can reduce the index size signi??cantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.
机译:在诸如数据清理,查询松弛和拼写检查之类的应用程序中,回答关于字符串集合的近似查询非常重要,在这些应用程序中,用户查询和数据中都存在不一致和错误。许多现有算法使用基于gram的倒排列表索引结构来回答近似字符串查询。这些索引结构与其原始字符串集合的大小相比“臭名昭著”。在本文中,我们研究如何在保持有效查询处理的同时将这种索引结构的大小减小到给定的空间量。我们首先研究如何采用现有的倒排列表压缩技术来解决我们的问题。然后,我们提出了两种新颖的方法来实现这一目标:一种基于丢弃gram列表,另一种基于组合相关列表。它们既与现有压缩技术正交,又利用了我们设置的独特属性,并为提高查询性能提供了新的机会。对于每种方法,我们分析其对查询性能的影响,并开发算法以明智地选择要丢弃或合并的列表。我们在真实数据集上进行的广泛实验表明,我们的方法为应用程序提供了决定查询性能和索引大小之间折衷方案的灵活性,并且可以胜过现有的压缩技术。一个有趣且令人惊讶的发现是,尽管我们可以在可容忍的性能损失下显着减少索引大小(减少多达60%),但与原始索引相比,减少20-40%甚至可以提高查询性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号