首页> 中文期刊> 《计算机学报》 >基于任务合并的并行大数据清洗过程优化

基于任务合并的并行大数据清洗过程优化

         

摘要

Data quality issues will result in lethal effects of big data applications,so it is needed to clean the big data with the problem of data quality.MapReduce programming framework can take advantage of parallel technology to achieve high scalability for large data cleaning.However, due to the lack of effective design,redundant computation exists in the cleaning process based on MapReduce,resulting in decreased performance.Therefore,the purpose of this paper is to optimize the parallel data cleaning process to improve efficiency.Through research,we found that some data cleaning tasks are often run on the same input file or using the same calculation results. Based on the discovery this paper presents a new optimization techniques — optimization techniques based task combinations.By merging redundant computation and several simple calculations for the same input file,we can reduce the number of rounds of MapReduce system thereby reducing the running time,and ultimately achieve system optimization.In this paper,some complex modules of data cleaning process have been optimized,respectively entity recognition module,inconsistent data recovery module,and the module of missing values filling.The experimental results show that the proposed strategy in this paper can effectively improve the efficiency of data cleaning.%数据质量问题会对大数据的应用产生致命影响,因此需要对存在数据质量问题的大数据进行清洗。 MapReduce 编程框架可以利用并行技术实现高可扩展性的大数据清洗,然而,由于缺乏有效的设计,在基于MapReduce 的数据清洗过程中存在计算的冗余,导致性能降低。因此文中的目的是对并行数据清洗过程进行优化从而提高效率。通过研究,作者发现数据清洗中一些任务往往都运行在同一输入文件上或者利用同样的运算结果,基于该发现文中提出了一种新的优化技术———基于任务合并的优化技术。针对冗余计算和利用同一输入文件的简单计算进行合并,通过这种合并可以减少 MapReduce 的轮数从而减少系统运行的时间,最终达到系统优化的目标。文中针对数据清洗过程中多个复杂的模块进行了优化,具体来说分别对实体识别模块、不一致数据修复模块和缺失值填充模块进行了优化。实验结果表明,文中提出的策略可以有效提高数据清洗的效率。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号