An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging

Feiyu Lian; Maixia Fu; Xingang Ju

摘要

Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scalability mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning.

机译：数据质量对粮食大数据的应用产生了重要影响，因此数据清洁是必要和重要的工作。在MapReduce帧中，并行技术通常用于在高可扩展性模式下执行数据清洁，但由于缺乏有效的设计，数据清洁过程中存在计算冗余，从而导致性能下降。在这项研究中，我们发现一些任务通常在相同的输入文件上多次进行，或者需要相同的操作导致数据清洁过程。对于此问题，我们提出了一种基于任务合并的新优化技术。通过在相同的输入文件上合并简单或冗余计算，MapReduce中的循环计算的数量可以大大减少。实验表明，通过这种方式，整个系统运行时间显着降低，这证明了数据清洁过程得到了优化。在本文中，我们优化了多个数据清洁模块，例如实体识别，不一致的数据恢复和缺失值填充。实验结果表明，本文中所提出的方法可以提高谷物大数据清洁的效率。

An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging

摘要

著录项

相关主题

期刊订阅