Benanza: Automatic μBenchmark Generation to Compute 'Lower-bound' Latency and Inform Optimizations of Deep Learning Models on GPUs

机译：Benanza：自动生成μBenchmark，以计算“下限”延迟并在GPU上通知深度学习模型的优化

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced.We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the "lower-bound" latency of DL models using the benchmark data and informs optimizations of model execution. The "lower-bound" latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

机译：随着深度学习（DL）模型越来越多地用于对延迟敏感的应用程序中，人们越来越关注改善其响应时间。进行此类改进的重要场所是剖析这些模型的执行情况，并表征其性能以识别可能的优化机会。但是，当前的性能分析工具缺乏表征理想性能，识别效率低下的来源以及量化潜在优化收益的高度期望的能力。这些缺陷导致了缓慢的表征/优化周期，无法跟上引入新DL模型的快节奏。 GPU。 Benanza由四个主要组件组成：一个将模型解析为内部表示形式的模型处理器，一个可配置的基准生成器，该基准生成器根据给定的一组模型自动生成微基准，一个基准结果数据库，以及一个计算“下界”的分析器。使用基准数据的DL模型的延迟，并为模型执行提供了优化依据。 “下限”等待时间量度可估计在GPU系统上理想模型的执行情况，并作为识别框架或系统库中优化机会的基础。我们使用Benanza在MXNet，ONNX Runtime和PyTorch上从Kepler到最新的Turing的7个GPU上评估了30个ONNX模型，并确定了并行层执行，cuDNN卷积算法选择，框架效率低下，层融合以及使用Tensor Cores的优化。

著录项

来源
《IEEE International Parallel and Distributed Processing Symposium》|2020年|440-450|共11页
会议地点
作者
Cheng Li; Abdul Dakkak; Jinjun Xiong; Wen-mei Hwu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Smarter Traffic Prediction Using Big Data, In-Memory Computing, Deep Learning and GPUs [J] . Muhammad Aqib, Rashid Mehmood, Ahmed Alzahrani, Sensors . 2019,第9期

机译：使用大数据，内存计算，深度学习和GPU进行更智能的流量预测
2. Front tracking in modelling of latent heat thermal energy storage: Assessment of accuracy and efficiency, benchmarking and GPU-based acceleration [J] . Klimes Lubomir, Mauder Tomas, Charvat Pavel, Energy . 2018,第jula15期

机译：潜热热能存储建模中的前沿跟踪：评估准确性和效率，基准测试和基于GPU的加速
3. An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs [J] . Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, IEICE transactions on information and systems . 2021,第12期

机译：一种基于FPGA的优化设计，用于多个GPU的分布式深度学习
4. Improving Accuracy of Automatic Fracture Detection in Borehole Images with Deep Learning and GPUs [C] . Rommel Anatoli Quintanilla Cruz, Diego Carriço Cacau, Renato Moraes dos Santos, SIBGRAPI Conference on Graphics, Patterns and Images . 2017

机译：使用深度学习和GPU提高钻孔图像中自动裂缝检测的准确性
5. Automatic transformation and optimization of applications on GPUs and GPU clusters. [D] . Ma, Wenjing. 2011

机译：在GPU和GPU群集上自动转换和优化应用程序。
6. Smarter Traffic Prediction Using Big Data In-Memory Computing Deep Learning and GPUs [O] . Muhammad Aqib, Rashid Mehmood, Ahmed Alzahrani, 2019

机译：使用大数据内存计算深度学习和GPU进行更智能的流量预测
7. Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs [O] . Cheng Li, Abdul Dakkak, Jinjun Xiong, 2020

机译：Benanza：自动μBenchmark发电，计算“低界”延迟，并通知GPU上深度学习模型的优化

Benanza: Automatic μBenchmark Generation to Compute 'Lower-bound' Latency and Inform Optimizations of Deep Learning Models on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅