首页> 外文会议>International Conference on Systems and Informatics >Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster
【24h】

Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster

机译:章鱼:基于地理分布的大数据分析集群的拥塞感知调度

获取原文

摘要

In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.
机译:近年来,大数据分析框架迅速兴起。同时,它已成为跨地理分布的数据输入生成,存储和处理大量数据的常规方法。网络之间的数据传输所产生的网络拥塞成为地理分布环境中系统整体性能的主要瓶颈。许多现有方法通常在发生网络拥塞后对其进行处理,这从根本上不能解决问题。在本文中,我们就其工作完成时间着眼于在Apache Spark上的地理分布环境中预先预测和避免网络拥塞的问题。我们将此问题表述为运行时最小化问题,由于场景具有不同的数据中心,因此在实践中很难解决。为了解决这些挑战,我们提出了一种基于拥塞感知调度的模型。在该模型中,我们利用SDN(软件定义网络)预先检测来自不同数据中心的数据流的数据大小,然后分析数据特征,从而预先预测可导致网络拥塞的数据流,以便我们可以针对不同的流程起草两种方案。另外,当我们检测到网络拥塞时,我们会为拥塞流选择一条具有更大带宽的路径。该方法可以最大程度地减少网络拥塞,提高网络利用率,并提高地理分布环境中的系统性能。作为本文的重点,我们将基于现代数据处理框架Apache Spark的作业调度程序设计并实现我们提出的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号