On the usability of Hadoop MapReduce, Apache Spark Apache flink for data science

机译：关于Hadoop MapReduce，Apache Spark和Apache flink在数据科学中的可用性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of more advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of master students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make Big Data platforms more - or less - effective for users in data science.

机译：用于云计算的分布式数据处理平台是进行大规模数据分析的重要工具。尽管Apache Hadoop MapReduce的编程接口相对较低，但已成为该领域的事实上的标准，即使对于简单的分析任务，也需要许多实现步骤。这导致开发了更高级的面向数据流的平台，最著名的是Apache Spark和Apache Flink。这些平台不仅旨在通过改进的内存处理来提高性能，而且还特别提供内置的高级数据处理功能，例如过滤和联接运算符，与普通的Hadoop MapReduce相比，这应使数据分析任务的开发更容易。。但是确实是这样吗？本文对三个著名的分布式数据处理平台进行了比较：Apache Hadoop MapReduce; Apache Hadoop MapReduce; Apache Hadoop MapReduce。 Apache Spark;从可用性的角度来看，还有Apache Flink。我们将与一群硕士研究生一起报告可用性研究的设计，执行和结果，这些研究生正在学习和使用这三个平台，以解决数据科学环境中设置的不同用例。我们的发现表明，与MapReduce相比，Spark和Flink是首选平台。在参与者之间，作为面向批处理大数据分析的平台，Spark和Flink在感知的偏好或开发时间上没有显着差异。这项研究开始探索使大数据平台对数据科学用户而言或多或少有效的因素。

著录项

来源
《IEEE International Conference on Big Data》|2017年|303-310|共8页
会议地点
作者
Bilal Akil; Ying Zhou; Uwe Röhm;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Usability; Sparks; Big Data; Cloud computing; Programming;

机译：可用性;火花;大数据;云计算;编程;

相似文献

外文文献
中文文献
专利

1. A Study and Performance Comparison of MapReduce and Apache Spark on Twitter Data on Hadoop Cluster [J] . Nowraj Farhan, Ahsan Habib, Arshad Ali International Journal of Information Technology and Computer Science . 2018,第7期

机译：Hadoop集群上Twitter数据上MapReduce和Apache Spark的研究和性能比较
2. Apache Hadoop YARN: moving beyond MapReduce and batch processing with Apache Hadoop 2 [J] . Aake Edlund Computing reviews . 2015,第8期

机译：Apache Hadoop YARN：超越MapReduce并使用Apache Hadoop 2进行批处理
3. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink [J] . Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Big Data Analytics . 2017,第1期

机译：Apache Spark和Apache Flink上批处理大数据处理的可伸缩性比较
4. On the usability of Hadoop MapReduce, Apache Spark Apache flink for data science [C] . Bilal Akil, Ying Zhou, Uwe R?hm IEEE International Conference on Big Data . 2017

机译：论数据科学的Hadoop MapReduce，Apache Spark＆Apache Flink的可用性
5. Deep Data Locality on Apache Hadoop [D] . Lee, Sungchul. 2018

机译：Apache Hadoop上的深度数据本地化
6. Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine [O] . Shunxing Bao, Frederick D. Weitendorf, Andrew J. Plassard, -1

机译：使用Apache Hadoop和Sun Grid Engine进行大数据图像处理的理论和经验比较
7. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink [O] . 2017

机译：Apache Spark和Apache Flink上批处理大数据处理的可伸缩性比较

On the usability of Hadoop MapReduce, Apache Spark Apache flink for data science

摘要

著录项

相似文献

相关主题

期刊订阅