首页> 外文期刊>Information Processing & Management >MapReduce indexing strategies: Studying scalability and efficiency
【24h】

MapReduce indexing strategies: Studying scalability and efficiency

机译:MapReduce索引策略:研究可伸缩性和效率

获取原文
获取原文并翻译 | 示例
           

摘要

In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for 10 intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.
机译:在信息检索(IR)中,对TB级和更大语料库的有效索引仍然是一个难题。已经提出了MapReduce作为在多个处理机之间分布数据密集型操作的框架。在这项工作中,我们提供了对四种复杂度不同的MapReduce索引策略的详细分析。此外,我们通过在现有的IR框架中实施这些索引策略,并使用Hadoop MapReduce实施以及几个大型标准TREC测试语料库进行实验,来评估这些索引策略。特别是,我们检查了索引策略的效率,对于最有效的策略,我们检查了它如何根据语料库大小和处理能力进行缩放。我们的结果证明了对于10个密集任务(例如索引),最大程度地减少计算机之间的数据传输的重要性,以及按发布列表进行MapReduce索引的策略的适用性,尤其是对于TB级的索引。因此,我们得出结论,MapReduce是部署大规模索引的合适框架。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号