首页> 外文会议>IEEE/ACM International Conference on Computer-Aided Design >SAT-based compilation to a non-vonNeumann processor
【24h】

SAT-based compilation to a non-vonNeumann processor

机译:基于SAT的非vonNeumann处理器编译

获取原文

摘要

This paper describes a compilation technique used to accelerate dataflow computations, common in deep neural network computing, onto Coarse Grained Reconfigurable Array (CGRA) architectures. This technique has been demonstrated to automatically compile dataflow programs onto a commercial massively parallel CGRA-based dataflow processor (DPU) containing 16000 processing elements. The DPU architecture overcomes the von Neumann bottleneck by spatially flowing and reusing data from local memories, and provides higher computation efficiency compared to temporal parallel architectures such as GPUs and multi-core CPUs. However, existing software development tools for CGRAs are limited to compiling domain specific programs to processing elements with uniform structures, and are not effective on complex micro architectures where latencies of memory access vary in a nontrivial fashion depending on data locality. A primary contribution of this paper is to provide a general algorithm that can compile general dataflow graphs, and can efficiently utilize processing elements with rich micro-architectural features such as complex instructions, multi-precision data paths, local memories, register files, switches etc. Another contribution is a uniquely innovative application of Boolean Satisfiability to formally solve this complex, and irregular optimization problem and produce high-quality results comparable to hand-written assembly code produced by human experts. A third contribution is an adaptive windowing algorithm that harnesses the complexity of the SAT-based approach and delivers a scalable and robust solution.
机译:本文介绍了一种用于加速深度神经网络计算中常见的数据流计算的编译技术,该技术适用于粗粒度可重配置阵列(CGRA)架构。已经证明该技术可以将数据流程序自动编译到包含16000个处理元件的商用大规模并行基于CGRA的数据流处理器(DPU)上。与GPU和多核CPU等时间并行架构相比,DPU架构通过空间流动和重用本地内存中的数据来克服冯·诺依曼瓶颈,并提供更高的计算效率。但是,现有的CGRA软件开发工具仅限于编译特定领域的程序以处理具有统一结构的元素,并且在复杂的微体系结构上无效,而在这些微体系结构中,内存访问的延迟会根据数据的局部性以不平凡的方式变化。本文的主要贡献是提供一种通用算法,该算法可以编译通用数据流图,并可以有效利用具有丰富微体系结构特征的处理元素,例如复杂指令,多精度数据路径,本地存储器,寄存器文件,开关等。 。另一个贡献是布尔可满足性的独特创新应用,可以正式解决这个复杂的,不规则的优化问题,并产生与人类专家手写的汇编代码相媲美的高质量结果。第三个贡献是自适应窗口算法,该算法利用了基于SAT的方法的复杂性,并提供了可扩展且强大的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号