Document representation for efficient search engines

机译：高效搜索引擎的文档表示

页面导航

摘要
著录项
相似文献
相关主题

摘要

Search engines process millions of user queries on a daily basis. The response to a query is typically in a form of a results page, construction of which involves identifying a set of documents that match the query — via an index — and, for each document, constructing a query-biased summary sourced from the document’s text. For a search system to achieve high throughput, the efficient processing of both these tasks is paramount. Most published work that aims to improve search efficiency is focused on optimising the inverted index. Techniques such as compression, pruning, caching, and reordering of inverted indexes have been shown to substantially speed up the query evaluation process. To date, however, there is no published literature that examines the efficient generation of query-biased summaries. In this thesis we propose a compression based scheme for representing documents that allows efficient snippet generation. We demonstrate that, while compared to a baseline, the proposed system provides slightly inferior compression rates, it is on average 60% faster in generating snippets. In addition to compression, we also explore lossy means of compacting documents, or document pruning. Using a document pruning scheme based on sentence reordering, we show that over half the content of a collection can be discarded, yet still be able to produce snippets of quality comparable to those derived using the full documents. Our experimental results show that, using pruned and then compressed documents as surrogates of the full, average snippet generation time is reduced by over 40%. In addition to limiting the amount of data processed, such pruned documents are candidates for caching. By caching such pruned surrogates, we show that a substantially higher cache hit ratio can be achieved. Moreover, the snippet generation throughput also increases by 58% compared to using a cache of full documents. Finally, we examine whether the combination of pruning and caching of inverted indexes can yield similar gains as with the pruned document surrogates. While pruning and caching of inverted indexes have been studied in parallel streams, there is little published work which examines their combination. Our experimental results on two large data-sets show that the use of the index pruning scheme we propose reduces by over 60% the amount of data processed when evaluating queries. By caching pruned inverted lists instead of the full inverted lists, we demonstrate that a gain of 7% in cache hit rate can be achieved. Together, these new methods substantially reduce the infrastructure required to provide a large-scale search service.

机译：搜索引擎每天处理数百万个用户查询。对查询的响应通常以结果页的形式出现，其结果包括：通过索引来识别与查询匹配的一组文档；对于每个文档，构建一个从文档文本中提取的与查询相关的摘要。对于要实现高吞吐量的搜索系统，这两个任务的有效处理至关重要。旨在提高搜索效率的大多数已发布工作都集中在优化倒排索引上。诸如压缩，修剪，缓存和倒排索引的重新排序之类的技术已显示出可大大加快查询评估过程。但是，迄今为止，尚无公开的文献来研究有效生成查询偏倚的摘要。在这篇论文中，我们提出了一种用于表示文档的基于压缩的方案，该方案允许有效的代码片段生成。我们证明，与基线相比，建议的系统提供的压缩率稍低，但生成摘要的速度平均要快60％。除了压缩外，我们还探索了压缩文档或修剪文档的有损方法。使用基于句子重新排序的文档修剪方案，我们表明可以丢弃集合中超过一半的内容，但仍然能够产生质量与使用完整文档的片段相当的片段。我们的实验结果表明，将经过修剪然后压缩的文档作为完整片段平均生成时间的替代，可以减少40％以上。除了限制处理的数据量外，此类修剪后的文档也是缓存的候选对象。通过缓存此类修剪的替代项，我们表明可以实现更高的缓存命中率。此外，与使用完整文档的缓存相比，摘要生成吞吐量也提高了58％。最后，我们检查修剪和倒排索引缓存的组合是否可以产生与修剪后的文档代理相似的收益。虽然已经研究了并行流中对倒排索引的修剪和缓存，但是很少有研究工作来检查它们的组合。我们在两个大数据集上的实验结果表明，我们提出的索引修剪方案的使用在评估查询时将处理的数据量减少了60％以上。通过缓存修剪的倒排列表而不是完整的倒排列表，我们证明了可以将缓存命中率提高7％。总之，这些新方法大大减少了提供大规模搜索服务所需的基础结构。

著录项

作者
Tsegay Y;
展开▼
作者单位

展开▼
年度 2009
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. XML Representation of Web Document used by Search Engine [J] . Payal Kansal, Mukesh Rawat International Journal of Engineering Trends and Technology . 2016,第6期

机译：搜索引擎使用的Web文档的XML表示
2. Efficient Top-k Document Retrieval for Long Queries Using Term-Document Binary Matrix — Pursuit of Enhanced Informational Search on the Web — [J] . Etsuro FUJITA, Keizo OYAMA IEICE transactions on information and systems . 2013,第5期

机译：使用术语文档二进制矩阵对长查询进行有效的Top-k文档检索-追求增强的Web信息搜索能力-
3. Efficient Top-k Document Retrieval for Long Queries Using Term-Document Binary Matrix: Pursuit of Enhanced Informational Search on the Web [J] . Etsuro Fujita, Keizo Oyama IEICE Transactions on Information and Systems . 2013,第5期

机译：使用术语文档二进制矩阵对长查询进行有效的Top-k文档检索：追求增强的Web信息搜索
4. Intelligent Search Engine algorithms on indexing and searching of text documents using text representation [C] . Minnie D., Srinivasan S. 2011 International Conference on Recent Trends in Information Systems . 2011

机译：使用文本表示法对文本文档建立索引和搜索的智能搜索引擎算法
5. Visualization of search engine query result using region-based document model on XML documents. [D] . Parikh, Sunish Umesh. 2000

机译：在XML文档上使用基于区域的文档模型来可视化搜索引擎查询结果。
6. Evaluation of a Document Search Engine in a Clinical Department System [O] . Stefan Schulz, Philipp Daumke, Pascal Fischer, 2008

机译：临床部门系统中文档搜索引擎的评估
7. Modern web search engines and web sites optimisation for efficient search [O] . Avsec Roman 2008

机译：现代网络搜索引擎和网站优化，可实现高效搜索

Document representation for efficient search engines

摘要

著录项

相似文献

相关主题

期刊订阅