首页> 外文OA文献 >Efficient indexing for big data in Hadoop MapReduce and main memory databases
【2h】

Efficient indexing for big data in Hadoop MapReduce and main memory databases

机译:Hadoop MapReduce和主内存数据库中大数据的高效索引

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In recent years, the database community has witnessed the advent and breakthrough of many new system designs like the famous Hadoop MapReduce or main memory databases like MonetDB, Hekaton, SAP Hana, and HyPer to solve the problems of “Big Data”. The system architectures in this generation of emerging systems often radically differ from traditional relational databases. For database research, this trend creates new challenges to design and optimize data structures for those novel architectures. Premier candidates for innovations are index structures, which are traditionally among the most crucial performance factors in databases. In this thesis, we focus on efficient indexing methods for Hadoop MapReduce and main memory databases. Our work consists of three independent parts that resulted from different research projects.In the first part, we introduce HAIL, a novel approach for efficient static and adaptive indexing in Hadoop MapReduce. We believe that efficient indexing becomes increasingly important in the context of very large data sets. HAIL combines very low index build times that are often even invisible for users, with significant runtime improvements for selective MapReduce jobs. We provide extensive experiments, and show that HAIL can improve job runtimes by up to 68x over Hadoop.In the second part of this thesis, we present an in-depth evaluation of the adaptive radix tree ART, a recent and very promising competitor in the domain of tree-indexes for main memory databases. ART was reported by its inventors to be significantly faster than previous tree indexes and even competitive to hash tables. However, the original evaluation of ART did not consider Judy Arrays, which is, to the best of our knowledge, the first data structure introducing adaptivity to radix trees. Furthermore, the hash table used in the comparison with ART was just a textbook implementation of chained hashing and not a more sophisticated state-of-the-art hash tables. We provide an extended analysis and experimental evaluation of ART, including a detailed comparison to Judy Arrays, hashing via quadratic probing, and three variants of Cuckoo hashing. Our results give a more differentiated look on ART. In particular, we present striking conceptual similarities between ART and Judy Arrays and show that well-engineered hash tables can beat the lookup throughput of adaptive radix trees by up 6x.In the third part, motivated by our previous results, we take a closer look at hashing methods in main memory databases. We identify seven key factors that influence hashing performance, evaluate their impact, and discuss the implications on hashing in modern databases. Our study indicates that choosing the right hashing method and configuration can make an order of magnitude difference in insert and lookup performance. We also provide a guideline for practitioners on when to use which hashing method.
机译:近年来,数据库社区见证了许多新系统设计的出现和突破,例如著名的Hadoop MapReduce或主内存数据库(例如MonetDB,Hekaton,SAP Hana和HyPer)来解决“大数据”问题。这一代新兴系统中的系统架构通常与传统的关系数据库完全不同。对于数据库研究而言,这种趋势给设计和优化这些新颖架构的数据结构带来了新的挑战。创新的主要候选者是索引结构,它通常是数据库中最关键的性能因素之一。在本文中,我们重点介绍了针对Hadoop MapReduce和主内存数据库的有效索引方法。我们的工作由不同研究项目得出的三个独立部分组成。在第一部分中,我们介绍了HAIL,这是一种用于在Hadoop MapReduce中高效进行静态和自适应索引的新颖方法。我们认为,在非常大的数据集中,有效的索引编制变得越来越重要。 HAIL结合了非常低的索引构建时间(对于用户来说甚至通常是看不见的)以及针对选择性MapReduce作业的显着运行时改进。我们提供了广泛的实验,并证明HAIL可以比Hadoop将作业运行时间提高68倍。在本文的第二部分,我们对自适应基数树ART进行了深入的评估,自适应基数树ART是Hadoop中最有希望的竞争对手主内存数据库的树索引域。据其发明者报道,ART比以前的树索引快得多,甚至比哈希表更具竞争力。但是,ART的原始评估并未考虑Judy Arrays,据我们所知,Judy Arrays是第一个将适应性引入基数树的数据结构。此外,与ART进行比较时使用的哈希表只是链式哈希的教科书实现,而不是更复杂的最新哈希表。我们提供了ART的扩展分析和实验评估,包括与Judy Arrays的详细比较,通过二次探测进行的哈希处理以及Cuckoo哈希处理的三种变体。我们的结果使ART的外观更具差异性。特别是,我们提出了ART与Judy Arrays之间惊人的概念相似性,并表明精心设计的哈希表可以将自适应基数树的查找吞吐量提高6倍。第三部分,基于我们之前的结果,我们仔细研究一下主内存数据库中的哈希方法。我们确定影响散列性能的七个关键因素,评估它们的影响,并讨论对现代数据库中散列的影响。我们的研究表明,选择正确的哈希方法和配置可以使插入和查找性能产生一个数量级的差异。我们还为从业人员提供何时使用哪种哈希方法的指南。

著录项

  • 作者

    Richter Stefan;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号