A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Diego García-Gil; Sergio Ramírez-Gallego; Salvador García; Francisco Herrera

首页> 外文期刊>Big Data Analytics >A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

【24h】

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

机译：Apache Spark和Apache Flink上批处理大数据处理的可伸缩性比较

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.

机译：大量数据导致需要新的处理框架。 MapReduce模型是使用并行和分布式算法处理和生成大规模数据集的框架。 Apache Spark是基于MapReduce模型的大规模通用数据处理的快速通用引擎。 Spark的主要功能是内存中计算。最近，出现了一个名为Apache Flink的新颖框架，该框架专注于分布式流和批处理数据处理。在本文中，我们使用相应的机器学习库进行批处理数据，对这两个框架的可伸缩性进行了比较研究。此外，我们分析了Spark当前拥有的两个机器学习库MLlib和ML的性能。对于实验，使用相同的算法和相同的数据集。实验结果表明，Spark MLlib比Flink具有更好的性能和更低的运行时间。

著录项

来源
《Big Data Analytics》 |2017年第1期|共页
作者
Diego García-Gil; Sergio Ramírez-Gallego; Salvador García; Francisco Herrera;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医学研究方法;
关键词

相似文献

外文文献
中文文献
专利

1. Big Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project [J] . Roland N. Boubela, Klaudius Kalcher, Wolfgang Huf, Frontiers in Neuroscience . 2015,第1期

机译：使用Apache Spark和GPU处理分析大型fMRI数据的大数据方法：来自人类Connectome项目的静态fMRI数据的演示
2. Big Data Processing with Apache Spark in Tertiary Institutions: Spark Streaming [J] . Emmanuel Boachie, Chunlin Li Journal of Information Engineering and Applications . 2017,第6期

机译：高校使用Apache Spark进行大数据处理：Spark流
3. Social media data processing infrastructure by using Apache Spark big data platform: Twitter data analysis [J] . Dominik Strzalka Computing reviews . 2021,第6期

机译：通过使用Apache Spark大数据平台的社交媒体数据处理基础架构：Twitter数据分析
4. On the usability of Hadoop MapReduce, Apache Spark Apache flink for data science [C] . Bilal Akil, Ying Zhou, Uwe Röhm IEEE International Conference on Big Data . 2017

机译：关于Hadoop MapReduce，Apache Spark和Apache flink在数据科学中的可用性
5. Streamlining Big Data Processing Pipelines via Unix Memory Tools, Persistent Spark Datasets, and the Apache Ignite Inmemory File System [D] . Blair, Walter 2018

机译：通过Unix内存工具，持久性Spark数据集和Apache Ignite内存文件系统简化大数据处理管道
6. Big Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project [O] . Roland N. Boubela, Klaudius Kalcher, Wolfgang Huf, 2015

机译：使用Apache Spark和GPU处理的大数据分析方法用于大规模fMRI数据：来自人类Connectome项目的静态fMRI数据的演示
7. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink [O] . 2017

机译：Apache Spark和Apache Flink上批处理大数据处理的可伸缩性比较

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

摘要

著录项

相似文献

相关主题

期刊订阅