Non-local Data Fetch Scheme Based on Delay Distribution for Hadoop Clusters in Public Cloud

机译：基于延迟分布的公有云Hadoop集群非本地数据获取方案

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Hadoop and its ecosystem have become the de facto platform for processing large-scale data, also known as Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Most of the Hadoop's features are designed for on-premise data center hosted clusters where cluster topology is known. With cloud-based computing becoming more popular and mature, more and more users deploy Hadoop clusters in public cloud environments. Hadoop depends on administrator configured rack assignment of servers to calculate the distance between servers. When fetching non-local data, Hadoop calculates the distance between servers to find the best remote server to fetch data from. However, in public cloud environments, it is impossible to know the rack assignment of virtual servers leaving Hadoop to fetch data from a remote server that is on the other side of the data center sometimes. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch data from. The proposed scheme selects a server comparing the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to the reduction in job runtime, especially in real-world multi-user clusters where non-local data fetching can happen frequently.

机译：Hadoop及其生态系统已成为处理大规模数据（也称为大数据）的事实上的平台，因为它隐藏了分布式计算，调度和通信的复杂性，同时提供了容错能力。 Hadoop的大多数功能都是针对已知群集拓扑的本地数据中心托管群集而设计的。随着基于云的计算变得越来越流行和成熟，越来越多的用户在公共云环境中部署Hadoop集群。 Hadoop取决于管理员配置的服务器机架分配，以计算服务器之间的距离。在获取非本地数据时，Hadoop计算服务器之间的距离，以找到最佳的远程服务器以从中获取数据。但是，在公共云环境中，不可能知道虚拟服务器的机架分配，而这些虚拟服务器离开Hadoop有时会从位于数据中心另一侧的远程服务器获取数据。为了克服这个问题，我们提出了一种基于延迟分布的方案，以找到最接近的服务器来从中获取数据。提出的方案选择一个服务器，比较服务器对之间的延迟分布。计算延迟分布，以定期测量服务器之间的往返时间。我们的实验发现，就非本地数据获取时间而言，该方案比传统Hadoop的性能高出近12％。数据获取时间的减少将导致作业运行时间的减少，特别是在实际的多用户群集中，非本地数据获取可能经常发生。

著录项

来源
《IEEE International Conference on Big Data Security on Cloud;IEEE International Conference on High Performance and Smart Computing;IEEE International Conference on Intelligent Data and Security》|2018年|188-193|共6页
会议地点
作者
Ravindra Sandaruwan Ranaweera; Eiji Oki; Nattapong Kitsuwan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Servers; Cloud computing; Delays; Task analysis; Topology; Standards organizations; Organizations;

机译：服务器;云计算;延迟;任务分析;拓扑;标准组织;组织;

相似文献

外文文献
中文文献
专利

1. Load feedback-based resource scheduling and dynamic migration-based data locality for virtual hadoop clusters in openstack-based clouds [J] . Dan Tao, Zhaowen Lin, Bingxu Wang Tsinghua Science and Technology . 2017,第2期

机译：基于Openstack的云中的虚拟hadoop集群的基于负载反馈的资源调度和基于动态迁移的数据局部性
2. Load Feedback-Based Resource Scheduling and Dynamic Migration-Based Data Locality for Virtual Hadoop Clusters in OpenStack-Based Clouds [J] . Dan Tao, Zhaowen Lin, Bingxu Wang 清华大学学报（英文版） . 2017,第002期

机译：在基于OpenStack的云中为虚拟Hadoop集群加载基于反馈的资源调度和基于动态迁移的数据局部性
3. Cloud Based Gateway Clustering of Cloud Data Retrieval with GCD Recovery Scheme [J] . Padmakumari.P, Umamakeswari.A, Shanthi.P International Journal of Engineering and Technology . 2013,第5期

机译：基于GCD恢复方案的云数据检索基于云的网关集群
4. An Efficient Cloud-Based Revocable Identity-Based Proxy Re-encryption Scheme for Public Clouds Data Sharing [C] . Kaitai Liang, Joseph K. Liu, Duncan S. Wong, European symposium on research in computer security . 2014

机译：一种有效的基于云的可撤销基于身份的代理重加密方案，用于公共云数据共享
5. Performance of Hadoop based Replica Exchange Molecular Dynamics on cloud computing. [D] . Niu, Jin. 2013

机译：基于Hadoop的副本交换分子动力学在云计算上的性能。
6. CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment [O] . Jeongsu Oh, Chi-Hwan Choi, Min-Kyu Park, -1

机译：CLUSTOM-CLOUD：基于内存数据网格的软件用于在云环境中对16S rRNA序列数据进行聚类
7. Public Cloud Storage for the Seismic Big Data Based on Amazon EC2 Cluster and Hadoop [O] . Jie Xiong, Song Zhang 2017

机译：基于Amazon EC2集群和Hadoop的地震大数据的公共云存储

Non-local Data Fetch Scheme Based on Delay Distribution for Hadoop Clusters in Public Cloud

摘要

著录项

相似文献

相关主题

期刊订阅