首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >Massive Data Load on Distributed Database Systems over HBase
【24h】

Massive Data Load on Distributed Database Systems over HBase

机译:HBase上的分布式数据库系统上的海量数据加载

获取原文

摘要

Big Data has become a pervasive technology to manage the ever-increasing volumes of data. Among Big Data solutions, scalable data stores play an important role, especially, key-value data stores due to their large scalability (thousands of nodes). The typical workflow for Big Data applications include two phases. The first one is to load the data into the data store typically as part of an ETL (Extract-Transform-Load) process. The second one is the processing of the data itself. BigTable and HBase are the preferred key-value solutions based on range-partitioned data stores. However, the loading phase is inefficient and creates a single node bottleneck. In this paper, we identify and quantify this bottleneck and propose a tool for parallel massive data loading that solves satisfactorily the bottleneck enabling all the parallelism and throughput of the underlying key-value data store during the loading phase as well. The proposed solution has been implemented as a tool for parallel massive data loading over HBase, the key-value data store of the Hadoop ecosystem.
机译:大数据已成为管理不断增长的数据卷的普遍存在技术。在大数据解决方案中,可扩展的数据存储扮演重要的作用,特别是由于其巨大的可扩展性(数千个节点)而导致的键值数据存储。大数据应用的典型工作流包括两个阶段。第一个是将数据加载到数据存储中,通常是ETL(提取变换负载)过程的一部分。第二个是数据本身的处理。 BigTable和HBase是基于范围分区数据存储的首选键值解决方案。但是,加载阶段效率低,并创建单个节点瓶颈。在本文中,我们识别和量化该瓶颈,并提出了一种平行大规模数据加载的工具,该工具解决了令人满意的瓶颈,该瓶颈在加载阶段期间能够实现底层键值数据存储的所有并行性和吞吐量。所提出的解决方案已实现为HABASE的平行大规模数据加载的工具,Hadoop生态系统的键值数据存储器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号