首页> 外文OA文献 >Document representation for efficient search engines
【2h】

Document representation for efficient search engines

机译:高效搜索引擎的文档表示

摘要

Search engines process millions of user queries on a daily basis. The response to a query is typically in a form of a results page, construction of which involves identifying a set of documents that match the query — via an index — and, for each document, constructing a query-biased summary sourced from the document’s text. For a search system to achieve high throughput, the efficient processing of both these tasks is paramount. Most published work that aims to improve search efficiency is focused on optimising the inverted index. Techniques such as compression, pruning, caching, and reordering of inverted indexes have been shown to substantially speed up the query evaluation process. To date, however, there is no published literature that examines the efficient generation of query-biased summaries. In this thesis we propose a compression based scheme for representing documents that allows efficient snippet generation. We demonstrate that, while compared to a baseline, the proposed system provides slightly inferior compression rates, it is on average 60% faster in generating snippets. In addition to compression, we also explore lossy means of compacting documents, or document pruning. Using a document pruning scheme based on sentence reordering, we show that over half the content of a collection can be discarded, yet still be able to produce snippets of quality comparable to those derived using the full documents. Our experimental results show that, using pruned and then compressed documents as surrogates of the full, average snippet generation time is reduced by over 40%. In addition to limiting the amount of data processed, such pruned documents are candidates for caching. By caching such pruned surrogates, we show that a substantially higher cache hit ratio can be achieved. Moreover, the snippet generation throughput also increases by 58% compared to using a cache of full documents. Finally, we examine whether the combination of pruning and caching of inverted indexes can yield similar gains as with the pruned document surrogates. While pruning and caching of inverted indexes have been studied in parallel streams, there is little published work which examines their combination. Our experimental results on two large data-sets show that the use of the index pruning scheme we propose reduces by over 60% the amount of data processed when evaluating queries. By caching pruned inverted lists instead of the full inverted lists, we demonstrate that a gain of 7% in cache hit rate can be achieved. Together, these new methods substantially reduce the infrastructure required to provide a large-scale search service.
机译:搜索引擎每天处理数百万个用户查询。对查询的响应通常以结果页的形式出现,其结果包括:通过索引来识别与查询匹配的一组文档;对于每个文档,构建一个从文档文本中提取的与查询相关的摘要。对于要实现高吞吐量的搜索系统,这两个任务的有效处理至关重要。旨在提高搜索效率的大多数已发布工作都集中在优化倒排索引上。诸如压缩,修剪,缓存和倒排索引的重新排序之类的技术已显示出可大大加快查询评估过程。但是,迄今为止,尚无公开的文献来研究有效生成查询偏倚的摘要。在这篇论文中,我们提出了一种用于表示文档的基于压缩的方案,该方案允许有效的代码片段生成。我们证明,与基线相比,建议的系统提供的压缩率稍低,但生成摘要的速度平均要快60%。除了压缩外,我们还探索了压缩文档或修剪文档的有损方法。使用基于句子重新排序的文档修剪方案,我们表明可以丢弃集合中超过一半的内容,但仍然能够产生质量与使用完整文档的片段相当的片段。我们的实验结果表明,将经过修剪然后压缩的文档作为完整片段平均生成时间的替代,可以减少40%以上。除了限制处理的数据量外,此类修剪后的文档也是缓存的候选对象。通过缓存此类修剪的替代项,我们表明可以实现更高的缓存命中率。此外,与使用完整文档的缓存相比,摘要生成吞吐量也提高了58%。最后,我们检查修剪和倒排索引缓存的组合是否可以产生与修剪后的文档代理相似的收益。虽然已经研究了并行流中对倒排索引的修剪和缓存,但是很少有研究工作来检查它们的组合。我们在两个大数据集上的实验结果表明,我们提出的索引修剪方案的使用在评估查询时将处理的数据量减少了60%以上。通过缓存修剪的倒排列表而不是完整的倒排列表,我们证明了可以将缓存命中率提高7%。总之,这些新方法大大减少了提供大规模搜索服务所需的基础结构。

著录项

  • 作者

    Tsegay Y;

  • 作者单位
  • 年度 2009
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号