...
首页> 外文期刊>The VLDB journal >On the optimization of schedules for MapReduce workloads in the presence of shared scans
【24h】

On the optimization of schedules for MapReduce workloads in the presence of shared scans

机译:在存在共享扫描的情况下优化MapReduce工作负载的计划

获取原文
获取原文并翻译 | 示例

摘要

We consider MapReduce clusters designed to support multiple concurrent jobs, concentrating on environments in which the number of distinct datasets is modest relative to the number of jobs. In such scenarios, many individual datasets are likely to be scanned concurrently by multiple Map phase jobs. As has been noticed previously, this scenario provides an opportunity for Map phase jobs to cooperate, sharing the scans of these datasets, and thus reducing the costs of such scans. Our paper has three main contributions over previous work. First, we present a novel and highly general method for sharing scans and thus amortizing their costs. This concept, which we call cyclic piggybacking, has a number of advantages over the more traditional batching scheme described in the literature. Second, we notice that the various subjobs generated in this manner can be assumed in an optimal schedule to respect a natural chain precedence ordering. Third, we describe a significant but natural generalization of the recently introduced flex scheduler for optimizing schedules within the context of this cyclic piggybacking paradigm, which can be tailored to a variety of cost metrics. Such cost metrics include average response time, average stretch, and any minimax-type metric-a total of 11 separate and standard metrics in all. Moreover, most of this carries over in the more general case of overlapping rather than identical datasets as well, employing what we will call semi-shared scans. In such scenarios, chain precedence is replaced by arbitrary precedence, but we can still handle 8 of the original 11 metrics. The overall approach, including both cyclic piggybacking and the FLEX scheduling generalization, is called CIRCUMFLEX. We describe some practical implementation strategies. And we evaluate the performance of circumflex via a variety of simulation and real benchmark experiments.
机译:我们考虑MapReduce群集,这些群集旨在支持多个并发作业,重点是在这样的环境中,不同数据集的数量相对于作业数量而言是中等的。在这种情况下,多个Map阶段作业可能会同时扫描许多单独的数据集。如前所述,此方案为Map阶段作业提供了合作的机会,共享了这些数据集的扫描,从而降低了此类扫描的成本。我们的论文对以前的工作有三点主要贡献。首先,我们提出了一种新颖且高度通用的共享扫描方法,从而摊销其成本。与文献中描述的更传统的批处理方案相比,此概念(我们称为循环搭载)具有许多优点。其次,我们注意到可以按照最佳计划表假设以这种方式生成的各种子作业,以遵守自然链优先顺序。第三,我们描述了最近引入的Flex调度程序的重要但自然的概括,用于在此循环搭载范式的背景下优化调度,该调度可根据各种成本指标进行定制。此类成本指标包括平均响应时间,平均拉伸和任何minimax-type指标-总共共有11个单独的标准指标。而且,大多数情况在更普遍的重叠而不是完全相同的数据集的情况下仍然有效,采用了我们称为半共享扫描的方式。在这种情况下,链式优先级被任意优先级代替,但是我们仍然可以处理原始11个指标中的8个。包括循环搭载和FLEX调度一般化在内的整体方法称为CIRCUMFLEX。我们描述了一些实际的实施策略。并且,我们通过各种模拟和真实的基准实验来评估回旋支的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号