首页> 外文会议>International Conference on Computer and Communication Engineering >SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing
【24h】

SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing

机译:SMBSP:一种使用机器学习的自调优方法,可提高大数据处理中Spark的性能

获取原文

摘要

Apache Spark, popularly known for big data processing capability, is a distributed open-source platform that uses the concept of distributed memory to facilitate big data processing proficiently. From the aspect of performance, it is still a big challenge to obtain the best output from Spark, since the Spark configuration settings with large parameters configuration affect its performance at large extent. Spark has over 180 parameters which control the system performance. These parameters have default values, which lie in a range. User can manually select the suitable values for each parameter. Improper choice of the parameter value leads to poor performance. Manual tuning of the parameters in Hadoop-Spark system requires user to have in-depth knowledge on the system. Because of large parameter space, manual tuning is very time consuming and inefficient. Retuning of the parameters may be required for each different application. This paper propose and developed an effective, self-tuning approach, namely SMBSP, based on Artificial Neural Network (ANN) to avoid the drawbacks of manual tuning of parameters. Dell Poweredge R720 server has been utilized with 5 different sizes of dataset to implement the approach. Furthermore, this approach is found to speed-up the performance of the Spark system by 35% (on an average) compared with default parameter configuration.
机译:Apache Spark以大数据处理能力而闻名,它是一个分布式开源平台,它使用分布式内存的概念来方便地进行大数据处理。从性能方面来说,从Spark获得最佳输出仍然是一个很大的挑战,因为具有大参数配置的Spark配置设置在很大程度上影响了其性能。 Spark有超过180个参数来控制系统性能。这些参数具有默认值,该默认值在一定范围内。用户可以为每个参数手动选择合适的值。参数值选择不当会导致性能下降。在Hadoop-Spark系统中手动调整参数需要用户对系统有深入的了解。由于参数空间大,手动调整非常耗时且效率低下。对于每个不同的应用,可能需要重新调整参数。本文提出并开发了一种有效的自调整方法,即基于人工神经网络(ANN)的SMBSP,以避免手动调整参数的弊端。 Dell Poweredge R720服务器已与5种不同大小的数据集一起使用,以实现该方法。此外,与默认参数配置相比,该方法可将Spark系统的性能平均提高35%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号