首页> 外文期刊>ACM Transactions on Modeling and Performance Evaluation of Computing Systems >Scale-Out vs Scale-Up: A Study of ARM-based SoCs on Server-Class Workloads
【24h】

Scale-Out vs Scale-Up: A Study of ARM-based SoCs on Server-Class Workloads

机译:横向扩展与纵向扩展:基于ARM的SoC在服务器级工作负载上的研究

获取原文
获取原文并翻译 | 示例
       

摘要

ARM 64-bit processing has generated enthusiasm to develop ARM-based servers that are targeted for both data centers and supercomputers. In addition to the server-class components and hardware advancements, the ARM software environment has grown substantially over the past decade. Major development ecosystems and libraries have been ported and optimized to run on ARM, making ARM suitable for server-class workloads. There are two trends in available ARM SoCs: mobile-class ARM SoCs that rely on the heterogeneous integration of a mix of CPU cores, GPGPU streaming multiprocessors (SMs), and other accelerators, and the server-class SoCs that instead rely on integrating a larger number of CPU cores with no GPGPU support and a number of IO accelerators. For scaling the number of processing cores, there are two different paradigms: mobile-class SoCs that use scale-out architecture in the form of a cluster of simpler systems connected over a network, and server-class ARM SoCs that use the scale-up solution and leverage symmetric multiprocessing to pack a large number of cores on the chip. In this article, we present ScaleSoC cluster, which is a scale-out solution based on mobile class ARM SoCs. ScaleSoC leverages fast network connectivity and GPGPU acceleration to improve performance and energy efficiency compared to previous ARM scale-out clusters. We consider a wide range of modern server-class parallel workloads to study both scaling paradigms, including latency-sensitive transactional workloads, MPI-based CPU and GPGPU-accelerated scientific applications, and emerging artificial intelligence workloads. We study the performance and energy efficiency of ScaleSoC compared to server-class ARM SoCs and discrete GPGPUs in depth. We quantify the network overhead on the performance of ScaleSoC and show that packing a large number of ARM cores on a single chip does not necessarily guarantee better performance, due to the fact that shared resources, such as last-level cache, become performance bottlenecks. We characterize the GPGPU accelerated workloads and demonstrate that for applications that can leverage the better CPU-GPGPU balance of the ScaleSoC cluster, performance and energy efficiency improve compared to discrete GPGPUs.
机译:ARM 64位处理引起了开发针对数据中心和超级计算机的基于ARM的服务器的热情。除了服务器级组件和硬件方面的进步外,ARM软件环境在过去十年中也得到了大幅发展。已移植并优化了主要的开发生态系统和库,使其可以在ARM上运行,从而使ARM适合于服务器级的工作负载。可用的ARM SoC有两种趋势:依靠CPU内核,GPGPU流多处理器(SM)和其他加速器的混合异构集成的移动级ARM SoC,以及依靠集成CPU内核的服务器级SoC。不支持GPGPU的大量CPU内核和大量IO加速器。为了扩展处理核心的数量,有两种不同的范例:使用扩展结构的移动级SoC,其形式是通过网络连接的较简单系统的集群;以及使用扩展的服务器级ARM SoC。解决方案并利用对称多处理功能将大量内核封装在芯片上。在本文中,我们介绍了ScaleSoC集群,它是基于移动类ARM SoC的横向扩展解决方案。与以前的ARM横向扩展群集相比,ScaleSoC利用快速的网络连接和GPGPU加速来提高性能和能效。我们考虑了各种现代服务器级并行工作负载,以研究两种扩展范例,包括对延迟敏感的事务工作负载,基于MPI的CPU和GPGPU加速的科学应用程序以及新兴的人工智能工作负载。与服务器级ARM SoC和离散GPGPU相比,我们研究了ScaleSoC的性能和能效。我们量化了ScaleSoC性能的网络开销,并表明在单个芯片上打包大量ARM内核并不一定保证更好的性能,这是由于共享资源(例如最后一级缓存)成为性能瓶颈这一事实。我们对GPGPU加速的工作负载进行了表征,并证明了对于可以利用ScaleSoC集群中更好的CPU-GPGPU平衡的应用程序,与分立的GPGPU相比,性能和能效都有所提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号