首页> 外文学位 >Minimization of resource consumption through workload consolidation in large-scale distributed data platforms.
【24h】

Minimization of resource consumption through workload consolidation in large-scale distributed data platforms.

机译:通过在大型分布式数据平台中进行工作负载合并来最大程度地减少资源消耗。

获取原文
获取原文并翻译 | 示例

摘要

The rapid increase in the data volumes encountered in many application domains has led to widespread adoption of parallel and distributed data management systems like parallel databases and MapReduce-based frameworks (e.g., Hadoop) in recent years. Use of such parallel and distributed frameworks is expected to accelerate in the coming years, putting further strain on already-scarce resources like compute power, network bandwidth, and energy. To reduce total execution times, there is a trend towards increasing execution parallelism by spreading out data across a large number of machines. However, this often increases the total resource consumption, and especially energy consumption, significantly because of process startup costs and other overheads (e.g., communication overheads). In this dissertation, we develop several data management techniques to minimize resource consumption through workload consolidation.;In this dissertation, we introduce a key metric called query span, i.e., number of machines involved in the execution of a query or a job. In order to minimize the per query resource consumption we propose to minimize query span. To that end, we develop several workload-driven data partitioning and replica selection algorithms that attempt to minimize the average query span by exploiting the fact that most distributed environments need to use replication for fault tolerance. Extensive experiments on various datasets show that judicious data placement and replication can dramatically reduce the average query spans resulting in significant reductions in resource consumption. We show our results primarily on two applications, distributed data warehouse system and distributed information retrieval. In the first case, we show that minimizing average query spans can minimize overall resource consumption for a given workload and can also improve the performance of complex analytical queries. In the second case, our approach minimizes the overall search cost as well as effectively trades off search cost with load imbalance.;The best case of resource efficiency for any underlying data processing system is achieved when the job or the query can be run efficiently on a single machine (i.e., query span=1). In the final part of dissertation, we discuss an in-memory MapReduce system optimized for performing complex analytics tasks on input data sizes that fit in a single machine's memory. We argue that systems like Hadoop that are designed to operate across a large number of machines are not optimal in performance for small and medium sized complex analytics tasks because of high startup costs, heavy disk activity, and wasteful checkpointing. We have developed a prototype runtime called HONE that is API compatible with standard (distributed) Hadoop. In other words, we can take existing Hadoop code and run it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes.;Overall, in this dissertation, our key contributions in this work include identification of key metric query span and its relationship with overall resource consumption in scale-out architectures. We introduce several workload-aware techniques to optimize this key metric. We go on to demonstrate the effectiveness of query span minimization on different application scenarios. In order to take advantage of scale-up architectures effectively we develop novel in-memory MapReduce system HONE for single machine. Our thorough experiments on real and synthetic datasets demonstrate the efficacy of our proposed approaches.
机译:近年来,在许多应用领域中遇到的数据量的快速增长导致并行和分布式数据管理系统的广泛采用,例如并行数据库和基于MapReduce的框架(例如Hadoop)。预计在未来几年中将加速使用这种并行和分布式框架,这将进一步限制已经稀缺的资源,例如计算能力,网络带宽和能源。为了减少总执行时间,趋势是通过将数据分散到大量机器上来提高执行并行度。但是,由于过程启动成本和其他开销(例如,通信开销),这通常会增加总资源消耗,尤其是能量消耗。在本文中,我们开发了几种数据管理技术以通过整合工作量来最大程度地减少资源消耗。在本文中,我们引入了一个关键指标,称为查询跨度,即执行查询或作业所涉及的机器数量。为了最小化每个查询的资源消耗,我们建议最小化查询范围。为此,我们开发了几种工作负载驱动的数据分区和副本选择算法,它们试图通过利用大多数分布式环境需要使用复制来实现容错的事实来最小化平均查询范围。在各种数据集上进行的大量实验表明,明智的数据放置和复制可以大大减少平均查询范围,从而显着减少资源消耗。我们主要在两个应用程序(分布式数据仓库系统和分布式信息检索)上显示我们的结果。在第一种情况下,我们表明最小化平均查询范围可以最大程度地减少给定工作负载下的整体资源消耗,还可以提高复杂分析查询的性能。在第二种情况下,我们的方法可以最大程度地降低总体搜索成本,并在负载不平衡的情况下有效权衡搜索成本。当作业或查询可以高效运行时,可以实现任何基础数据处理系统资源效率的最佳情况一台机器(即,查询范围= 1)。在论文的最后一部分,我们讨论了一个内存中的MapReduce系统,该系统经过优化,可以对适合单台计算机内存的输入数据大小执行复杂的分析任务。我们认为,像Hadoop这样的旨在跨大量机器运行的系统,由于启动成本高昂,磁盘活动繁琐以及检查点浪费,因此对于中小型复杂分析任务的性能并不是最佳的。我们已经开发了一个称为HONE的原型运行时,该运行时与标准(分布式)Hadoop兼容。换句话说,我们可以采用现有的Hadoop代码并在多核共享内存计算机上运行它,而无需进行修改。这使我们能够采用现有的Hadoop算法,并找到最合适的运行时环境以在不同大小的数据集上执行。总的来说,在本文中,我们在这项工作中的主要贡献包括识别关键指标查询范围及其与整体资源消耗的关系。在横向扩展体系结构中。我们介绍了几种工作负载感知技术来优化此关键指标。我们继续演示在不同的应用方案上最小化查询范围的有效性。为了有效利用扩展架构,我们为单机开发了新颖的内存MapReduce系统HONE。我们在真实和综合数据集上进行的全面实验证明了我们提出的方法的有效性。

著录项

  • 作者

    Kayyoor, Ashwin Kumar.;

  • 作者单位

    University of Maryland, College Park.;

  • 授予单位 University of Maryland, College Park.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 205 p.
  • 总页数 205
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号