首页> 外文会议>Proceedings of the Sixteenth international world wide web conference(WWW2007) >Efficient Search in Large Textual Collections with Redundancy
【24h】

Efficient Search in Large Textual Collections with Redundancy

机译:大型文本集中的有效搜索与冗余

获取原文
获取原文并翻译 | 示例

摘要

Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there arc many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.rnIn this paper, we propose a general framework for indexing and query processing of archival collections and, more generally, any collections with a sufficient amount of redundancy. Our approach results in significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with the existing techniques. It also supports highly efficient updates, both locally and over a network. Within this framework, we describe and evaluate different implementations that trade off index size versus CPU cost and other factors, and discuss applications ranging from archival web search to local search of web sites, email archives, or file systems. We present experimental results based on search engine query log and a large collection consisting of multiple crawls.
机译:当前的网络搜索引擎专注于仅搜索网络的最新快照。但是,在某些情况下,希望搜索包括每个页面的许多不同爬网和版本的集合。这样的馆藏的一个重要例子是Internet档案馆,尽管还有很多其他的馆藏。由于此类存档的数据大小是单个快照的数倍,因此这给我们带来了巨大的性能挑战。当前的引擎使用各种技术来进行索引压缩和优化查询执行,但是这些技术并未利用页面的不同版本之间或不同页面之间的显着相似性。在本文中,我们提出了一种用于归档的索引和查询处理的通用框架。集合,更一般而言,具有足够冗余度的任何集合。我们的方法大大减少了此类集合的索引大小和查询处理成本,并且与现有技术正交并且可以与之结合。它还支持本地和网络上的高效更新。在此框架内,我们描述和评估了权衡索引大小与CPU成本和其他因素的不同实现方式,并讨论了从存档Web搜索到网站,电子邮件存档或文件系统的本地搜索的各种应用程序。我们根据搜索引擎查询日志和包含多个爬网的大量集合提供实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号