Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

机译：基于空间的语法索引，可实现有效的近似字符串搜索

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining ef??cient query processing. We ??rst study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the ??exibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising ??nding is that while we can reduce the index size signi??cantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.

机译：在诸如数据清理，查询松弛和拼写检查之类的应用程序中，回答关于字符串集合的近似查询非常重要，在这些应用程序中，用户查询和数据中都存在不一致和错误。许多现有算法使用基于gram的倒排列表索引结构来回答近似字符串查询。这些索引结构与其原始字符串集合的大小相比“臭名昭著”。在本文中，我们研究如何在保持有效查询处理的同时将这种索引结构的大小减小到给定的空间量。我们首先研究如何采用现有的倒排列表压缩技术来解决我们的问题。然后，我们提出了两种新颖的方法来实现这一目标：一种基于丢弃gram列表，另一种基于组合相关列表。它们既与现有压缩技术正交，又利用了我们设置的独特属性，并为提高查询性能提供了新的机会。对于每种方法，我们分析其对查询性能的影响，并开发算法以明智地选择要丢弃或合并的列表。我们在真实数据集上进行的广泛实验表明，我们的方法为应用程序提供了决定查询性能和索引大小之间折衷方案的灵活性，并且可以胜过现有的压缩技术。一个有趣且令人惊讶的发现是，尽管我们可以在可容忍的性能损失下显着减少索引大小（减少多达60％），但与原始索引相比，减少20-40％甚至可以提高查询性能。

著录项

来源
《Data Engineering, ICDE, 2009 IEEE 25th International Conference on》|2009年|P.604-615|共12页
会议地点
作者
Behm; Alexander; Ji; Shengyue; Li; Chen; Lu; Jiaheng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类工业技术;
关键词
Approximate String Search; Compression; Grams; Inverted Lists;

机译：近似字符串搜索;压缩;语法;倒排列表;

相似文献

外文文献
中文文献
专利

1. A Gram-Based String Paradigm for Efficient Video Subsequence Search [J] . Huang Z., Liu J., Cui B., Multimedia, IEEE Transactions on . 2013,第3期

机译：用于有效视频子序列搜索的基于Gram的字符串范例
2. HDIdx: High-dimensional indexing for efficient approximate nearest neighbor search [J] . Wan Ji, Tang Sheng, Zhang Yongdong, Neurocomputing . 2017,第MAY10期

机译：HDIdx：高维索引，可进行有效的近似最近邻居搜索
3. Space-Efficient String Indexing for Wildcard Pattern Matching [J] . Moshe Lewenstein, Yakov Nekrich, Jeffrey Scott Vitter LIPIcs : Leibniz International Proceedings in Informatics . 2014,第2期

机译：空间有效的字符串索引，用于通配符模式匹配
4. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search [C] . Alexander Behm, Shengyue Ji, Chen Li, IEEE International Conference on Data Engineering . 2009

机译：基于空间受限的克索引，用于有效近似字符串搜索
5. Spatial approximate string search. [D] . Yao, Bin. 2011

机译：空间近似字符串搜索。
6. Efficient privacy-preserving string search and an application in genomics [O] . Kana Shimizu, Koji Nuida, Gunnar Rätsch -1

机译：高效的隐私保护字符串搜索及其在基因组学中的应用
7. Space-constrained gram-based indexing for efficient approximate string search [O] . Alexander Behm, Shengyue Ji, Chen Li, 2008

机译：基于空间限制的基于gram的索引，可进行有效的近似字符串搜索

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

摘要

著录项

相似文献

相关主题

期刊订阅