...
首页> 外文期刊>Big Data Analytics >A comparison on scalability for batch big data processing on Apache Spark and Apache Flink
【24h】

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

机译:Apache Spark和Apache Flink上批处理大数据处理的可伸缩性比较

获取原文
           

摘要

The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.
机译:大量数据导致需要新的处理框架。 MapReduce模型是使用并行和分布式算法处理和生成大规模数据集的框架。 Apache Spark是基于MapReduce模型的大规模通用数据处理的快速通用引擎。 Spark的主要功能是内存中计算。最近,出现了一个名为Apache Flink的新颖框架,该框架专注于分布式流和批处理数据处理。在本文中,我们使用相应的机器学习库进行批处理数据,对这两个框架的可伸缩性进行了比较研究。此外,我们分析了Spark当前拥有的两个机器学习库MLlib和ML的性能。对于实验,使用相同的算法和相同的数据集。实验结果表明,Spark MLlib比Flink具有更好的性能和更低的运行时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号