首页> 外文会议>International Conference on High Performance Computing and Simulation >Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards
【24h】

Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards

机译:矢量化和多线程对Jetson板上性能和能耗的影响

获取原文

摘要

ARM processors are well known for their energy efficiency and are consequently widely used in embedded platforms. Like other processor architectures, they are built with different levels of parallelism, from Instruction Level Parallelism (out-of- order and superscalar capabilities) to Thread Level Parallelism (multicore), to increase their performance levels. These processors are now also targeting the HPC domain and will equip the Fujitsu Post-K supercomputer. Some ARM processors from the Cortex-A series, which equip smartphones and tablets, also provide Data Level Parallelism through SIMD units called NEON. These units are able to process 128-bit of data at a time, for example four 32bit floating point values. Taking advantage of these units requires code vectorization which may be performed automatically by the compiler or explicitly by using NEON intrinsics. Exploiting all these levels of parallelism may lead to better performance as well as a higher energy consumption. This is not an issue in the HPC domain where application development is driven by finding the best performance. However, developing for embedded applications is driven by finding the best trade-off between energy consumption and performance. In this paper, we propose to study the impact of vectorization and multithreading on both performance and energy consumption on some Nvidia Jetson boards. Results show that depending on the algorithm and on its implementation, vectorization may bring a similar speedup as an OpenMP scalar implementation but with a lower energy consumption. However, combining vectorization and multithreading may lead close to both the best performance level and the lowest energy consumption but not when running cores at their maximum frequencies.
机译:ARM处理器以其能效而闻名,因此广泛应用于嵌入式平台。与其他处理器架构一样,它们是用不同级别的并行性构建,从指令级并行性(Out-Out-Out-OutrycalaL功能)到螺纹级并行性(Multicore),以提高其性能等级。这些处理器现在也针对HPC域,并将装备Fujitsu Post-K超级计算机。来自Cortex-A系列的一些臂处理器,配备智能手机和平板电脑,还通过称为霓虹灯的SIMD单元提供数据级并行性。这些单元能够一次处理128位数据,例如四个32位浮点值。利用这些单元需要代码矢量化,该代码矢量化可以由编译器自动执行,或者通过使用霓虹Instins显式进行。利用所有这些水平的平行度可能导致更好的性能以及更高的能耗。这不是HPC域中的问题,其中通过找到最佳性能驱动应用程序开发。但是,通过在能源消耗和性能之间找到最佳权衡来驱动嵌入式应用程序的开发。在本文中,我们建议研究矢量化和多线程对一些NVIDIA Jetson板上的性能和能耗的影响。结果表明,根据算法和实现,矢量化可能会带来类似的加速作为OpenMP标量实现,但能耗较低。然而,组合矢量化和多线程可能接近最佳性能水平和最低能量消耗,而不是在其最大频率下运行核心时。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号