Search engines process millions of user queries on a daily basis. The response to a query is typically in a form of a results page, construction of which involves identifying a set of documents that match the query — via an index — and, for each document, constructing a query-biased summary sourced from the document’s text. For a search system to achieve high throughput, the efficient processing of both these tasks is paramount. Most published work that aims to improve search efficiency is focused on optimising the inverted index. Techniques such as compression, pruning, caching, and reordering of inverted indexes have been shown to substantially speed up the query evaluation process. To date, however, there is no published literature that examines the efficient generation of query-biased summaries. In this thesis we propose a compression based scheme for representing documents that allows efficient snippet generation. We demonstrate that, while compared to a baseline, the proposed system provides slightly inferior compression rates, it is on average 60% faster in generating snippets. In addition to compression, we also explore lossy means of compacting documents, or document pruning. Using a document pruning scheme based on sentence reordering, we show that over half the content of a collection can be discarded, yet still be able to produce snippets of quality comparable to those derived using the full documents. Our experimental results show that, using pruned and then compressed documents as surrogates of the full, average snippet generation time is reduced by over 40%. In addition to limiting the amount of data processed, such pruned documents are candidates for caching. By caching such pruned surrogates, we show that a substantially higher cache hit ratio can be achieved. Moreover, the snippet generation throughput also increases by 58% compared to using a cache of full documents. Finally, we examine whether the combination of pruning and caching of inverted indexes can yield similar gains as with the pruned document surrogates. While pruning and caching of inverted indexes have been studied in parallel streams, there is little published work which examines their combination. Our experimental results on two large data-sets show that the use of the index pruning scheme we propose reduces by over 60% the amount of data processed when evaluating queries. By caching pruned inverted lists instead of the full inverted lists, we demonstrate that a gain of 7% in cache hit rate can be achieved. Together, these new methods substantially reduce the infrastructure required to provide a large-scale search service.
展开▼