首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >Benanza: Automatic μBenchmark Generation to Compute 'Lower-bound' Latency and Inform Optimizations of Deep Learning Models on GPUs
【24h】

Benanza: Automatic μBenchmark Generation to Compute 'Lower-bound' Latency and Inform Optimizations of Deep Learning Models on GPUs

机译:Benanza:自动生成μBenchmark,以计算“下限”延迟并在GPU上通知深度学习模型的优化

获取原文
获取外文期刊封面目录资料

摘要

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced.We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the "lower-bound" latency of DL models using the benchmark data and informs optimizations of model execution. The "lower-bound" latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.
机译:随着深度学习(DL)模型越来越多地用于对延迟敏感的应用程序中,人们越来越关注改善其响应时间。进行此类改进的重要场所是剖析这些模型的执行情况,并表征其性能以识别可能的优化机会。但是,当前的性能分析工具缺乏表征理想性能,识别效率低下的来源以及量化潜在优化收益的高度期望的能力。这些缺陷导致了缓慢的表征/优化周期,无法跟上引入新DL模型的快节奏。 GPU。 Benanza由四个主要组件组成:一个将模型解析为内部表示形式的模型处理器,一个可配置的基准生成器,该基准生成器根据给定的一组模型自动生成微基准,一个基准结果数据库,以及一个计算“下界”的分析器。使用基准数据的DL模型的延迟,并为模型执行提供了优化依据。 “下限”等待时间量度可估计在GPU系统上理想模型的执行情况,并作为识别框架或系统库中优化机会的基础。我们使用Benanza在MXNet,ONNX Runtime和PyTorch上从Kepler到最新的Turing的7个GPU上评估了30个ONNX模型,并确定了并行层执行,cuDNN卷积算法选择,框架效率低下,层融合以及使用Tensor Cores的优化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号