Efficient Search in Large Textual Collections with Redundancy

机译：大型文本集中的有效搜索与冗余

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there arc many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.rnIn this paper, we propose a general framework for indexing and query processing of archival collections and, more generally, any collections with a sufficient amount of redundancy. Our approach results in significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with the existing techniques. It also supports highly efficient updates, both locally and over a network. Within this framework, we describe and evaluate different implementations that trade off index size versus CPU cost and other factors, and discuss applications ranging from archival web search to local search of web sites, email archives, or file systems. We present experimental results based on search engine query log and a large collection consisting of multiple crawls.

机译：当前的网络搜索引擎专注于仅搜索网络的最新快照。但是，在某些情况下，希望搜索包括每个页面的许多不同爬网和版本的集合。这样的馆藏的一个重要例子是Internet档案馆，尽管还有很多其他的馆藏。由于此类存档的数据大小是单个快照的数倍，因此这给我们带来了巨大的性能挑战。当前的引擎使用各种技术来进行索引压缩和优化查询执行，但是这些技术并未利用页面的不同版本之间或不同页面之间的显着相似性。在本文中，我们提出了一种用于归档的索引和查询处理的通用框架。集合，更一般而言，具有足够冗余度的任何集合。我们的方法大大减少了此类集合的索引大小和查询处理成本，并且与现有技术正交并且可以与之结合。它还支持本地和网络上的高效更新。在此框架内，我们描述和评估了权衡索引大小与CPU成本和其他因素的不同实现方式，并讨论了从存档Web搜索到网站，电子邮件存档或文件系统的本地搜索的各种应用程序。我们根据搜索引擎查询日志和包含多个爬网的大量集合提供实验结果。

著录项

来源
《Proceedings of the Sixteenth international world wide web conference(WWW2007)》|2007年|411-420|共10页
会议地点 Banff(CA);Banff(CA)
作者
Jiangong Zhang; Torsten Suel;
展开▼
作者单位

CIS Department Polytechnic University Brooklyn, NY 11201, USA;

CIS Department Polytechnic University Brooklyn, NY 11201, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词
search engines; inverted index; redundancy elimination; index compression; query execution;

机译：搜索引擎;倒排索引冗余消除；索引压缩查询执行;

相似文献

外文文献
中文文献
专利

1. Selective Search: Efficient and Effective Search of Large Textual Collections [J] . Kulkarni Anagha, Callan Jamie ACM Transactions on Information Systems . 2015,第4期

机译：选择性搜索：大型文本集的高效搜索
2. TSS: Efficient Term Set Search in Large Peer-to-Peer Textual Collections [J] . Computers, IEEE Transactions on . 2010,第7期

机译：TSS：大型对等文本集合中的有效术语集搜索
3. Efficient Fuzzy Search in Large Text Collections [J] . HANNAH BAST, MARJAN CELIKIK ACM Transactions on Information Systems . 2013,第2期

机译：大文本集中的有效模糊搜索
4. Efficient Search in Large Textual Collections with Redundancy [C] . Jiangong Zhang, Torsten Suel International world wide web conference . 2007

机译：冗余的大型文本集合中的高效搜索
5. Index compression and redundancy elimination in large textual collections. [D] . Yan, Hao. 2010

机译：大型文本集合中的索引压缩和冗余消除。
6. Correction: MergedTrie: Efficient textual indexing [O] . Antonio Ferrández, Jesús Peral 2015

机译：更正：MergedTrie：有效的文本索引
7. TSS: Efficient Term Set Search in Large Peer-to-Peer Textual Collections [O] . Chen, Hanhua, Yan, Jun, Jin, Hai, 2010

机译：TSS：大型对等文本集合中的有效术语集搜索
8. RELIABILITY IMPROVEMENT BY REDUNDANCY IN ELECTRONIC SYSTEMS Ⅱ AND EFFICIENT NEW REDUNDANCY SCHEME-RADIAL LOGIC [R] . Theresa F. Klaschka 1969

机译：电子系统中冗余的可靠性改进Ⅱ和有效的新的冗余方案 - 径向逻辑

Efficient Search in Large Textual Collections with Redundancy

摘要

著录项

相似文献

相关主题

期刊订阅