首页> 外文会议>International Conference on High Performance Computing Simulation >Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards
【24h】

Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards

机译:向量化和多线程对Jetson主板性能和能耗的影响

获取原文

摘要

ARM processors are well known for their energy efficiency and are consequently widely used in embedded platforms. Like other processor architectures, they are built with different levels of parallelism, from Instruction Level Parallelism (out-of- order and superscalar capabilities) to Thread Level Parallelism (multicore), to increase their performance levels. These processors are now also targeting the HPC domain and will equip the Fujitsu Post-K supercomputer. Some ARM processors from the Cortex-A series, which equip smartphones and tablets, also provide Data Level Parallelism through SIMD units called NEON. These units are able to process 128-bit of data at a time, for example four 32bit floating point values. Taking advantage of these units requires code vectorization which may be performed automatically by the compiler or explicitly by using NEON intrinsics. Exploiting all these levels of parallelism may lead to better performance as well as a higher energy consumption. This is not an issue in the HPC domain where application development is driven by finding the best performance. However, developing for embedded applications is driven by finding the best trade-off between energy consumption and performance. In this paper, we propose to study the impact of vectorization and multithreading on both performance and energy consumption on some Nvidia Jetson boards. Results show that depending on the algorithm and on its implementation, vectorization may bring a similar speedup as an OpenMP scalar implementation but with a lower energy consumption. However, combining vectorization and multithreading may lead close to both the best performance level and the lowest energy consumption but not when running cores at their maximum frequencies.
机译:ARM处理器以其能效闻名,因此广泛用于嵌入式平台。与其他处理器体系结构一样,它们采用不同级别的并行性构建,从指令级并行性(乱序和超标量功能)到线程级并行性(多核),以提高性能水平。这些处理器现在也针对HPC域,并将配备Fujitsu Post-K超级计算机。一些配备有智能手机和平板电脑的Cortex-A系列ARM处理器也通过称为NEON的SIMD单元提供数据级并行性。这些单元能够一次处理128位数据,例如四个32位浮点值。利用这些单元需要代码向量化,这可以由编译器自动执行,也可以通过使用NEON内在函数明确执行。利用所有这些级别的并行性可能会导致更好的性能以及更高的能耗。这在HPC域中不是问题,在该域中,通过寻找最佳性能来驱动应用程序开发。但是,嵌入式应用程序的开发是通过在能耗和性能之间找到最佳折衷来驱动的。在本文中,我们建议在某些Nvidia Jetson板上研究矢量化和多线程对性能和能耗的影响。结果表明,根据算法及其实现,矢量化可以带来与OpenMP标量实现类似的加速,但能耗更低。但是,将向量化和多线程相结合可能会导致接近最佳性能水平和最低能耗,但是在以最大频率运行内核时却不会。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号