【24h】

Scheduling Shared Scans of Large Data Files

机译:安排大数据文件的共享扫描

获取原文

摘要

We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs. This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files.As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads. Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high. We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches.
机译:我们研究了在对一组通用文件同时发出许多请求的情况下,如何最好地安排对大型数据文件的扫描。目的是通过尽可能积极地共享同一文件的扫描,以最大程度地提高处理这些文件的总体速度,而不会在单个作业上增加不必要的等待时间。这种调度问题出现在诸如Map-Reduce系统之类的批处理数据处理环境中,其中某些环境每天通过一组共享文件处理数以万计的处理请求。 正如我们所演示的,在存在跨工作共享机会的情况下,诸如最短工作优先的常规调度技术效果不佳。我们得出了专门针对可共享工作负载的新的调度策略系列。我们的调度策略围绕这样一个概念,即在所有其他条件相同的情况下,如果可以共享的未来工作的到达率很高,最好在可以与未来工作共享IO工作的扫描之前计划不可共享的扫描。我们通过对各种综合和实际工作负载进行仿真来评估我们的策略,并证明与常规调度方法相比,性能得到了显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号