...
首页> 外文期刊>International journal of parallel programming >An Efficient Programming Skeleton for Clusters of Multi-Core Processors
【24h】

An Efficient Programming Skeleton for Clusters of Multi-Core Processors

机译:多核处理器集群的高效编程框架

获取原文
获取原文并翻译 | 示例
           

摘要

This paper proposes a divide and conquer skeleton which aids parallel system programmers by (1) reducing programming complexity, (2) shortening programming time, and (3) enhancing code efficiency. To do this, the proposed skeleton exploits three mechanisms of (1) work-stealing, and (2) communication/computation overlapping, and (3) architectural awareness in the proposed divide and conquer skeleton. Using the work-stealing mechanism, when a processing element reaches a low-load condition, the processing core fetches a new job from the waiting queue of other cores. The second mechanism uses special threads to enable the proposed skeleton to overlapping computations with communications. The third mechanism considers the architectural parameters of the host system e.g., size of L1 cache, network bandwidth, network latency to maximally match a divide and conquer problem with the proposed skeleton. To evaluate the proposed skeleton, three benchmarks of merge sort, fast Fourier transform, and standard matrix multiplication are developed by the proposed skeleton as well as customized programming. Experiments are done in both simulation and real implementation environments. The set of six codes are simulated using COTSon simulator and also implemented on 28 dual-core real system. Obtained results from simulations showed an average of 12.6% speed-up of the proposed skeleton as compared to the customized programming; obtained speed-up in real environment is 9.6%. Furthermore, programming aided by the proposed skeleton, is at least 70% faster than custom programming while this difference increases as the program volume increases.
机译:本文提出了一种分而治之的框架,该框架通过(1)降低编程复杂性,(2)缩短编程时间以及(3)增强代码效率来帮助并行系统程序员。为此,拟议的框架利用了三种机制:(1)窃取工作,以及(2)通信/计算重叠,以及(3)拟议的分治框架中的架构意识。使用工作窃取机制,当处理元素达到低负载条件时,处理核心将从其他核心的等待队列中获取新作业。第二种机制使用特殊线程来使建议的框架能够将计算与通信重叠。第三种机制考虑了主机系统的体系结构参数,例如L1缓存的大小,网络带宽,网络等待时间,以最大程度地将分治问题与所提出的框架相匹配。为了评估拟议的框架,通过拟议的框架以及定制的程序开发了三种合并排序,快速傅立叶变换和标准矩阵乘法的基准。实验是在仿真环境和实际实现环境中进行的。使用COTSon模拟器对这六个代码集进行了仿真,并且还可以在28个双核真实系统上实现。从仿真中获得的结果表明,与定制编程相比,拟议骨架的平均速度提高了12.6%;在实际环境中获得的加速为9.6%。此外,借助建议的框架进行的编程比自定义编程至少快70%,而这种差异随着程序量的增加而增加。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号