...
首页> 外文期刊>Concurrency and computation: practice and experience >Vectorizing unstructured mesh computations for many-core architectures
【24h】

Vectorizing unstructured mesh computations for many-core architectures

机译:向量化多核架构的非结构化网格计算

获取原文
获取原文并翻译 | 示例
           

摘要

Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips. Copyright © 2015 John Wiley & Sons, Ltd.
机译:在最新的多核和多核体系结构上实现最佳性能越来越取决于有效利用硬件的矢量单元。本文介绍了在非常规应用的关键类别:非结构化网格计算上,通过在CPU上进行矢量化和Xeon-Phi实现高性能的结果。使用单指令多线程(SIMT)和单指令多数据(SIMD)编程模型,我们展示了如何通过使用OP2域特定库中的代码生成技术,将非结构化网格计算映射到OpenCL或矢量内在函数,并探索不规则存储器访问的方式比赛条件可以在不同的硬件上组织。我们使用海啸模拟和具有代表性的CFD基准测试对英特尔至强CPU和至强融核进行基准测试。将结果与先前在CPU和NVIDIA GPU上的工作进行比较,以比较当前多核系统上可实现的性能。我们显示,由于矢量化问题和线程开销,自动矢量化和OpenCL SIMT模型不能有效地映射到CPU矢量单元。相比之下,使用SIMD向量内在函数会施加一些限制,并且需要更多的编程技术,但会产生有效的代码和接近最佳的性能,比未向量化的代码快两倍。我们观察到Xeon-Phi不能为这些应用提供良好的性能,但仍可以与一对中端Xeon芯片相媲美。版权所有©2015 John Wiley&Sons,Ltd.

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号