首页> 外文期刊>Journal of supercomputing >Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems
【24h】

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

机译:任务调度的体系结构支持:NUMA系统上数据流的硬件调度

获取原文
获取原文并翻译 | 示例

摘要

To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and use the available resources. One common approach is to decompose a program into parallel 'tasks' and allow an underlying system layer to schedule these tasks to different threads. Software-only schedulers can implement various scheduling policies and algorithms that match the characteristics of different applications and programming models. Unfortunately with large-scale multi-core systems, software schedulers suffer significant overheads as they synchronize and communicate task information over deep cache hierarchies. To reduce these overheads, hardware-only schedulers like Carbon have been proposed to enable task queuing and scheduling to be done in hardware. This paper presents a hardware scheduling approach where the structure provided to programs by task-based programming models can be incorporated into the scheduler, making it aware of a task's data requirements. This prior knowledge of a task's data requirements allows for better task placement by the scheduler which result in a reduction in overall cache misses and memory traffic, improving the program's performance and power utilization. Simulations of this technique for a range of synthetic benchmarks and components of real applications have shown a reduction in the number of cache misses by up to 72 and 95 % for the L1 and L2 caches, respectively, and up to 30 % improvement in overall execution time against FIFO scheduling. This results not only in faster execution and in less data transfer with reductions of up to 50 %, allowing for less load on the interconnect, but also in lower power consumption.
机译:为了利用具有数十到数百个内核的多核系统的计算资源,应用程序必须向硬件公开并行性。研究人员正在积极寻找程序执行模型,以使其更易于公开并行性和使用可用资源。一种常见的方法是将程序分解为并行的“任务”,并允许底层系统层将这些任务调度到不同的线程。纯软件调度程序可以实现各种调度策略和算法,以匹配不同应用程序和编程模型的特征。不幸的是,对于大型多核系统,软件调度程序在深层缓存层次结构上同步和传递任务信息时会遭受大量开销。为了减少这些开销,已经提出了诸如Carbon的仅硬件调度程序,以使任务排队和调度能够在硬件中完成。本文提出了一种硬件调度方法,其中可以将基于任务的编程模型提供给程序的结构合并到调度程序中,从而使其了解任务的数据要求。对任务数据要求的这种先验知识可以使调度程序更好地放置任务,从而减少总体缓存丢失和内存流量,从而提高程序的性能和功耗。对该技术在一系列综合基准和实际应用程序组件中的仿真表明,L1和L2缓存的缓存未命中数量分别减少了72%和95%,整体执行效率提高了30% FIFO调度的时间。这不仅可以加快执行速度,减少数据传输,减少多达50%的空间,从而减少互连的负载,还可以降低功耗。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号