Parallel computation of information gain using Hadoop and MapReduce

机译：使用Hadoop和MapReduce的信息增益并行计算

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Nowadays, companies collect data at an increasingly high rate to the extent that traditional implementation of algorithms cannot cope with it in reasonable time. On the other hand, analysis of the available data is a key to the business success. In a Big Data setting tasks like feature selection, finding discretization thresholds of continuous data, building decision threes, etc are especially difficult. In this paper we discuss how a parallel implementation of the algorithm for computing the information gain can address these issues. Our approach is based on writing Pig Latin scripts that are compiled into MapReduce jobs which then can be executed on Hadoop clusters. In order to implement the algorithm first we define a framework for developing arbitrary algorithms and then we apply it for the task at hand. With intent to analyze the impact of the parallelization, we have processed the FedCSIS AAIA'14 dataset with the proposed implementation of the information gain. During the experiments we evaluate the speedup of the parallelization compared to a one-node cluster. We also analyze how to optimally determine the number of map and reduce tasks for a given cluster. To demonstrate the portability of the implementation we present results using an on-premises and Amazon AWS clusters. Finally, we illustrate the scalability of the implementation by evaluating it on a replicated version of the same dataset which is 80 times larger than the original.

机译：如今，公司以越来越高的算法收集数据，即传统实施算法不能在合理的时间内应对它。另一方面，对现有数据的分析是业务成功的关键。在特征选择等较大的数据设置任务中，发现连续数据的离散阈值，建立决策三分之类尤其困难。在本文中，我们讨论如何计算信息增益的算法的并行实现如何解决这些问题。我们的方法是基于编写猪拉丁文脚本，该脚本编译成MapReduce作业，然后可以在Hadoop集群上执行。为了首先实现算法，我们为开发任意算法的框架定义了一个框架，然后我们将其应用于手头的任务。意图分析并行化的影响，我们已经通过拟议的信息收益执行了FEDCSIS AAIA'14数据集。在实验期间，与单节点群集相比，我们评估了并行化的加速。我们还分析了如何最佳地确定地图的数量并减少给定群集的任务。为了展示实现的可移植性，我们使用本地和亚马逊AWS集群显示结果。最后，我们通过在同一数据集的复制版本上评估它的缩放性，这是与原始数据集的80倍。

著录项

来源
《Federated Conference on Computer Science and Information Systems》|2015年||共12页
会议地点
作者
Zdravevski Eftim; Lameski Petre; Kulakov Andrea; Filiposka Sonja; Trajanov Dimitar; Jakimovski Boro;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Big Data; information dissemination; parallel processing; Amazon AWS cluster; Big data; FedCSIS AAIA'14 dataset; Hadoop; MapReduce; Pig Latin script; information gain; parallel computation; Entropy; Loading; Machine learning algorithms; Mathematical model; Parallel processing; Servers; Writing; Hadoop; MapReduce; feature ranking; information gain; parallelization;

机译：大数据;信息传播;并行处理;亚马逊AWS群集;大数据;FEDCSIS AAIA'14数据集;HADOOP;MAPRADUCE;猪拉丁文脚本;信息增益;并行计算;加载;加载;并行处理;并行处理;并行加工;服务器;写作;hadoop;mapreduce;特征排名;信息增益;并行化;

相似文献

外文文献
中文文献
专利

1. High Performance Computation of Big Data: Performance Optimization Approach towards a Parallel Frequent Item Set Mining Algorithm for Transaction Data based on Hadoop MapReduce Framework [J] . Guru Prasad M S, Nagesh H R, Swathi Prabhu International Journal of Intelligent Systems and Applications . 2017,第1期

机译：大数据的高性能计算：基于Hadoop MapReduce框架的事务数据并行频繁项集挖掘算法的性能优化方法
2. Data Encoding and Parallelization Porting Techniques to Transform Binary Data Formats to Hadoop/MapReduce [J] . NASA Tech Briefs . 2016,第5期

机译：数据编码和并行化移植技术，可将二进制数据格式转换为Hadoop / MapReduce
3. Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering [J] . Zahid Ansari, Asif Afzal, Tanvir Habib Sardar Journal of The Institution of Engineers (India): Series B . 2019,第2期

机译：使用基于Hadoop MapReduce的并行K均值聚类的数据分类
4. Parallel computation of information gain using Hadoop and MapReduce [C] . Zdravevski Eftim, Lameski Petre, Kulakov Andrea, Federated Conference on Computer Science and Information Systems . 2015

机译：使用Hadoop和MapReduce并行计算信息获取
5. Scalable parallel computing on clouds: Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations on cloud environments. [D] . Gunarathne, Thilina. 2014

机译：云上的可伸缩并行计算：高效且可伸缩的架构，可在云环境上执行令人满意的并行，MapReduce和迭代式数据密集型计算。
6. MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services [O] . Brian Pratt, J. Jeffry Howbert, Natalie I. Tasman, -1

机译：MR-Tandem：在Amazon Web Services上使用Hadoop MapReduce进行并行X！Tandem
7. Parallel computation of information gain using Hadoop and MapReduce [O] . Eftim Zdravevski, Petre Lameski, Andrea Kulakov, 2015

机译：使用Hadoop和MapReduce的信息增益并行计算

Parallel computation of information gain using Hadoop and MapReduce

摘要

著录项

相似文献

相关主题

期刊订阅