首页> 外文学位 >Allocation-oriented Algorithm Design with Application to GPU Computing.
【24h】

Allocation-oriented Algorithm Design with Application to GPU Computing.

机译:面向分配的算法设计及其在GPU计算中的应用。

获取原文
获取原文并翻译 | 示例

摘要

The wide data-parallelism of GPU processor design facilitates the execution of many concurrent, fine-grained tasks with unprecedented performance and efficiency. However, contemporary opinion regards GPU architecture as being poorly suited for many parallelizations that impose dynamic dependences between processing elements. This dissertation specifically attends to problems whose efficient solutions require cooperative allocation, i.e., their dependences stem from dynamic data placement within shared structures. The contribution of this thesis is a set of design patterns and reusable, tunable software primitives that foster the construction of cooperative, allocation-oriented parallelizations that are significantly faster, more efficient, more flexible, and more performance-portable than the state of the art.;Whereas concurrent CPU programs typically leverage fine-grained atomic operations for coordinating data placement, we demonstrate the advantages of parallel prefix sum as a high performance, bulk-synchronous alternative for cooperative allocation. We construct very efficient algorithms for global and local prefix sum by employing flexible granularity coarsening techniques for properly balancing serial and parallel workloads. The resulting efficiency gains permit kernel fusion techniques whereby application-specific logic can be incorporated within our prefix sum constructs with little or no overhead.;To demonstrate the viability of our methods, we construct cooperative GPU implementations for a variety of parallel list-processing primitives including reduction, prefix scan, duplicate removal, histogram, and reduce-by-key. We evaluate their performance across a wide spectrum of problem sizes, types, and target architectures. To achieve performance-portability, we present a policy-based autotuning design idiom in which we leave implementation decisions having opaque performance consequences unbound within the program text. Finally, we showcase high performance implementations for two archetypal problems in computer science: sorting and breadth-first search (BFS). We achieve excellent performance for both genres, demonstrating multiple factors of speedup over prior parallelizations for GPU and conventional multi-core platforms.
机译:GPU处理器设计的广泛数据并行性以前所未有的性能和效率促进了许多并发,细粒度任务的执行。但是,现代观点认为GPU体系结构不适合许多并行化,这些并行化在处理元素之间施加了动态依赖性。本论文专门针对那些需要有效分配解决方案的问题,即它们的依赖源于共享结构中动态数据的放置。本文的贡献是提供了一组设计模式和可重用,可调的软件原语,这些原语可促进构建协作的,面向分配的并行化,与现有技术相比,该并行化显着更快,更高效,更灵活且性能更佳。尽管并发CPU程序通常利用细粒度的原子操作来协调数据放置,但我们展示了并行前缀总和作为协作分配的高性能,批量同步替代方案的优点。通过采用灵活的粒度粗化技术来适当平衡串行和并行工作负载,我们为全局和本地前缀总和构造了非常有效的算法。由此产生的效率提高允许采用核融合技术,从而可以将应用程序特定的逻辑合并到我们的前缀和构造中,而几乎没有开销或没有开销。为了证明我们方法的可行性,我们为各种并行列表处理原语构造了协作GPU实现包括缩小,前缀扫描,重复删除,直方图和按键缩小。我们评估它们在各种问题大小,类型和目标体系结构中的性能。为了实现性能的可移植性,我们提出了一种基于策略的自动调整设计习惯用法,其中我们将具有不透明性能后果的实现决策留在程序文本中。最后,我们展示了计算机科学中两个原型问题的高性能实现:排序和广度优先搜索(BFS)。我们在两种类型的游戏中均取得了出色的性能,这证明了在加速GPU和传统多核平台并行化方面的多个因素。

著录项

  • 作者

    Merrill, Duane G., III.;

  • 作者单位

    University of Virginia.;

  • 授予单位 University of Virginia.;
  • 学科 Engineering Computer.;Computer Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 188 p.
  • 总页数 188
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号