【24h】

Non-local Data Fetch Scheme Based on Delay Distribution for Hadoop Clusters in Public Cloud

机译:基于延迟分布的公有云Hadoop集群非本地数据获取方案

获取原文

摘要

Hadoop and its ecosystem have become the de facto platform for processing large-scale data, also known as Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Most of the Hadoop's features are designed for on-premise data center hosted clusters where cluster topology is known. With cloud-based computing becoming more popular and mature, more and more users deploy Hadoop clusters in public cloud environments. Hadoop depends on administrator configured rack assignment of servers to calculate the distance between servers. When fetching non-local data, Hadoop calculates the distance between servers to find the best remote server to fetch data from. However, in public cloud environments, it is impossible to know the rack assignment of virtual servers leaving Hadoop to fetch data from a remote server that is on the other side of the data center sometimes. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch data from. The proposed scheme selects a server comparing the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to the reduction in job runtime, especially in real-world multi-user clusters where non-local data fetching can happen frequently.
机译:Hadoop及其生态系统已成为处理大规模数据(也称为大数据)的事实上的平台,因为它隐藏了分布式计算,调度和通信的复杂性,同时提供了容错能力。 Hadoop的大多数功能都是针对已知群集拓扑的本地数据中心托管群集而设计的。随着基于云的计算变得越来越流行和成熟,越来越多的用户在公共云环境中部署Hadoop集群。 Hadoop取决于管理员配置的服务器机架分配,以计算服务器之间的距离。在获取非本地数据时,Hadoop计算服务器之间的距离,以找到最佳的远程服务器以从中获取数据。但是,在公共云环境中,不可能知道虚拟服务器的机架分配,而这些虚拟服务器离开Hadoop有时会从位于数据中心另一侧的远程服务器获取数据。为了克服这个问题,我们提出了一种基于延迟分布的方案,以找到最接近的服务器来从中获取数据。提出的方案选择一个服务器,比较服务器对之间的延迟分布。计算延迟分布,以定期测量服务器之间的往返时间。我们的实验发现,就非本地数据获取时间而言,该方案比传统Hadoop的性能高出近12%。数据获取时间的减少将导致作业运行时间的减少,特别是在实际的多用户群集中,非本地数据获取可能经常发生。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号