首页> 外文期刊>ACM transactions on the web >A Term-Based Inverted Index Partitioning Model for Efficient Distributed Query Processing
【24h】

A Term-Based Inverted Index Partitioning Model for Efficient Distributed Query Processing

机译:基于术语的反向索引分区模型,用于高效的分布式查询处理

获取原文
获取原文并翻译 | 示例

摘要

In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.
机译:在不共享内容的分布式文本检索系统中,查询是通过在多个索引服务器之间分区的反向索引进行处理的。实际上,索引是基于文档的分区或基于术语的分区。根据基础硬件基础结构的属性,查询流量分配以及一些性能和可用性约束来做出选择。在采用基于术语的索引分区策略的检索系统上的查询处理中,由于从索引服务器传输大量数据而导致的高通信开销形成了主要的性能瓶颈,从而降低了整个分布式检索系统的可伸缩性。在这项工作中,为了缓解此问题,我们提出了一种依赖超图分区的新颖的倒排索引分区模型。在提出的模型中,基于从过去查询日志中提取的反向索引访问模式,将同时访问的索引条目分配给相同的索引服务器。该模型旨在最大程度地减少将来查询产生的通信开销,同时保持索引服务器之间的计算负载平衡。我们通过使用真实文本集和搜索查询示例的大量实验来评估所提出模型的性能。我们的结果表明,相对于先前在文献中提出的基于术语的索引分区策略,可以实现可观的性能提升。但是,在大多数情况下,性能仍然不及基于文档的分区所达到的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号