首页> 中文期刊>计算机应用 >基于MapReduce的Hadoop大表导入编程模型

基于MapReduce的Hadoop大表导入编程模型

     

摘要

针对Sqoop在导入大表时表现出的不稳定和效率较低两个主要问题,设计并实现了一种新的基于MapReduce的大表导入编程模型.该模型对于大表的切分算法是:将大表总的记录数对mapper数求步长,获得对应每个split的SQL查询语句的起始行和区间长度(等于步长),从而保证每个mapper的导入工作量完全相同.该模型的map方式是:进入map函数的键值对中的键是一个split所对应的SQL语句,将查询放在map函数中完成,从而使得模型中的每个mapper只调用一次map函数.对比实验表明:两个记录数相同的大表,无论其记录区间如何分布,其导入时间基本相同,或者对同一表分别用不同的分割字段,导入时间也完全相同;而对于同一个大表,模型的导入效率比Sqoop有显著提高.%To solve the problems of instability and inefficiency when data from a relation database system are transferred into Hadoop Distributed File System (HDFS) using Sqoop,the authors proposed and implemented a new programming model based on MapReduce framework.The algorithm splitting a big table in this model was as follows:firstly a step was calculated by dividing the total lines by the mapper number,then a SQL statement corresponding to each split could be constructed with a start line index and a span range equal to the above step,so this approach could guarantee that each mapper task would issue identical SQL workload.In map phrase,a mapper would only call map function once,with the single key-value pair below:the key was the above SQL statement corresponding to a split,and the value was null.The comparison experiments show that,for two different big tables with the same number of records,the respective importing time was approximately identical regardless of the records distribution,while using two different splitting fields in one big table,the importing time was also the same.At the same time,when applying two different approaches to one big table,the importing efficiency using the model was largely promoted than that using Sqoop.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号