首页> 外文会议>IEEE International Symposium on Computer Architecture and High Performance Computing >Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC
【24h】

Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC

机译:TFLITE-SOC的加速器设计空间探索和端到端DNN评估

获取原文

摘要

Recently there has been a rapidly growing demand for faster machine learning (ML) processing in data centers and migration of ML inference applications to edge devices. These developments have prompted both industry and academia to explore custom accelerators to optimize ML executions for performance and power. However, identifying which accelerator is best equipped for performing a particular ML task is challenging, especially given the growing range of ML tasks, the number of target environments, and the limited number of integrated modeling tools. To tackle this issue, it is of paramount importance to provide the computer architecture research community with a common framework capable of performing a comprehensive, uniform, and fair comparison across different accelerator designs targeting a particular ML task. To this aim, we propose a new framework named TFLITE-SOC (System On Chip) that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language's hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis. In addition to providing rapid prototyping, integrated benchmarking, and a range of platform configurations, TFLITE-SOC offers comprehensive performance analysis of accelerator occupancy and execution time breakdown as well as a rich set of modules that can be used by new accelerators to implement scaling up studies and optimized memory transfer protocols. We present our framework and demonstrate its utility by considering the design space of a TPU-like systolic array and describing possible directions for optimization. Using a compression technique, we implement an optimization targeting reducing the memory traffic between DRAM and on-device buffers. Compared to the baseline accelerator, our optimized design shows up to 1.26x speedup on accelerated operations and up to 1.19x speedup on end-to-end DNN execution.
机译:最近,对数据中心中更快的机器学习(ML)处理以及将ML推理应用程序迁移到边缘设备的需求迅速增长。这些发展促使业界和学术界都在探索定制的加速器,以优化ML执行的性能和功能。但是,要确定哪种加速器最适合执行特定的ML任务具有挑战性,特别是考虑到ML任务的范围不断扩大,目标环境的数量以及集成建模工具的数量有限的情况。为了解决这个问题,为计算机体系结构研究界提供一个通用框架,使其能够跨针对特定ML任务的不同加速器设计执行全面,统一和公平的比较,这一点至关重要。为此,我们提出了一个名为TFLITE-SOC(片上系统)的新框架,该框架集成了轻量级系统建模库(SystemC),用于将自定义ML加速器的快速设计空间探索到Tensorflow Lite(TFLite)的构建/执行环境中,一个用于ML推理的非常流行的ML框架。使用这种方法,我们能够利用语言的分层设计功能对SystemC中开发的新加速器进行建模和评估,从而加快设计原型的速度。此外,使用TFLITE-SOC设计的任何加速器都可以通过与TFLite兼容的任何DNN模型进行基准测试,以进行端到端DNN处理和详细的性能分析(即每个DNN层)。除了提供快速的原型制作,集成的基准测试和各种平台配置之外,TFLITE-SOC还提供加速器占用率和执行时间分解的全面性能分析,以及丰富的模块集,新加速器可以使用这些模块来实现扩展研究并优化了内存传输协议。通过介绍类似TPU的脉动阵列的设计空间并描述可能的优化方向,我们介绍了我们的框架并展示了其实用性。使用压缩技术,我们实现了优化目标,以减少DRAM与设备内缓冲区之间的内存通信量。与基准加速器相比,我们的优化设计在加速操作时显示出高达1.26倍的加速,在端到端DNN执行时显示出高达1.19倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号