首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Tensor Optimization for High-Level Synthesis Design Flows
【24h】

Tensor Optimization for High-Level Synthesis Design Flows

机译:高级合成设计流动的张量优化

获取原文
获取原文并翻译 | 示例

摘要

Improving data locality of tensor data structures is a crucial optimization for maximizing the performance of machine learning and intensive linear algebra applications. While CPUs and GPUs improve data locality by means of automated caching mechanisms, FPGAs let the developer specify data structure allocation. Although this feature enables a high degree of customizability, the increasing complexity and memory footprint of modern applications prevent considering any manual approach to find an optimal allocation. For this reason, we propose a compiler optimization to automatically improve the tensor allocation of high-level software descriptions. The optimization is controlled by a flexible cost model that can be tuned by means of simple yet expressive callback functions. In this way, the user can tailor the optimization strategy with respect to the optimization goal. We tested our methodology integrating our optimization in the Bambu open-source HLS framework. In this setting, we achieved a 14% speedup on the digit recognition version proposed by the Rosetta benchmark. Moreover, we tested our optimization on the CHStone benchmark suite, achieving an average of 6% speedup. Finally, we applied our methodology on two industrial examples from the aerospace domain obtaining a 15% speedup. As a final step, we tested the versatility of our methodology inserting our optimization in the Clang software optimization flow achieving a 12% speedup on the Rosetta benchmark when running on CPU.
机译:提高张量数据结构的数据局部性是最大限度地实现机器学习和密集线性代数应用的关键优化。虽然CPU和GPU通过自动化缓存机制来改善数据局部,但FPGA让开发人员指定数据结构分配。虽然此功能可实现高度的定制性,但现代应用的越来越复杂性和内存占用空间可以防止考虑任何手动方法来找到最佳分配。因此,我们提出了一个编译器优化,以自动提高高级软件描述的张量分配。优化由灵活的成本模型控制,可以通过简单且呈现的回调函数进行调谐。以这种方式,用户可以根据优化目标来定制优化策略。我们测试了我们在Bambu开源HLS框架中集成了我们优化的方法。在此设置中,我们在Rosetta基准测试的数字识别版本上实现了14%的加速。此外,我们在Chstone基准套件上测试了我们的优化,平均增速了6%。最后,我们在航空域的两个工业例子上应用了方法,从而获得了15%的加速。作为最后一步,我们测试了我们在CLANTTA基准在CPU上运行时在CLANG软件优化流程中插入我们的优化的多功能性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号