Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs

机译：GPU上基于变压器的深度学习模型的节能推断服务

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.

机译：最近由云服务提供商推出的推理 - AS-Service（IAAS），以支持按需AI应用程序。许多自然语言处理（NLP）服务基于变压器序列转换模型。然而，变压器模型的推理过程消耗的能量的量显著由于大模型尺寸（例如，数十亿参数）和巨大的计算。如何降低IAAS的能源消耗而不违反服务级别协议（SLA）成为服务提供商的实际挑战。在这项工作中，我们对用于语言翻译服务培训的变压器模型的推理性能和能效进行了全面的研究。首先，我们经验表征了一些基本的性能指标，包括三种不同GPU的延迟，吞吐量和能耗，具有多样化的工作负载配置。详细的工作负载分离有助于彻底和深刻的理解变压器模型的推理过程。其次，我们为基于观察到的数据提供了变压器的能耗模型。最后，我们提出了对齐的调度方案，可优化吞吐量和能效，其平均延迟损耗的成本高达2.86×和2.73倍。我们的调查结果提供了完整的变压器推论，并表明工作负载平衡和调度具有很大的潜力，可以提供节能变压器推理服务。

著录项

来源
《IEEE International Conference on Green Computing and Communications;IEEE International Conference on Internet of Things;IEEE International Conference on Cyber, Physical and Social Computing;IEEE International Conference on Smart Data;IEEE Congress on Cybermatics》|2020年|323-331|共9页
会议地点
作者
Yuxin Wang; Qiang Wang; Xiaowen Chu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Graphics processing units; Throughput; Energy consumption; Shape; Data models; Computational modeling; Benchmark testing;

机译：图形处理单元;吞吐量;能量消耗;形状;数据模型;计算建模;基准测试;

相似文献

外文文献
中文文献
专利

1. An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications [J] . Park Seong-Wook, Park Junyoung, Bong Kyeongryeol, Biomedical Circuits and Systems, IEEE Transactions on . 2015,第6期

机译：具有Te-Parallel MIMD架构的节能高效且可扩展的深度学习/推理处理器，适用于大数据应用
2. A transformer-based deep learning model for recognizing communication-oriented entities from patents of ICT in construction [J] . Wu Hengqin, Shen Geoffrey Qiping, Lin Xue, Automation in construction . 2021,第May期

机译：一种基于变压器的深度学习模型，用于识别ICT专利施工的通信实体
3. Hierarchical Graph Transformer-Based Deep Learning Model for Large-Scale Multi-Label Text Classification [J] . Gong Jibing, Teng Zhiyong, Teng Qi, Quality Control, Transactions . 2020,第期

机译：基于分层图形变换器的大型多标签文本分类的深度学习模型
4. On the Evaluation of Energy-Efficient Deep Learning Using Stacked Autoencoders on Mobile GPUs [C] . G. Falcao, L. A. Alexandre, J. Marques, Euromicro International Conference on Parallel, Distributed and Network-Based Processing . 2017

机译：关于在移动GPU上使用堆叠式自动编码器的节能深度学习的评估
5. Towards Energy-Efficient and Reliable Deep Learning Inference [D] . Zhang, Jeff (Jun). 2020

机译：迈向节能可靠的深度学习推论
6. GPU-Accelerated Machine Learning Inference as a Service for Computing in Neutrino Experiments [O] . Michael Wang, Tingjun Yang, Maria Acosta Flechas, 2020

机译：GPU加速机器学习推断作为中微子实验计算的服务
7. GPU coprocessors as a service for deep learning inference in high energy physics [O] . Jeffrey Krupa, Kelvin Lin, Maria Acosta Flechas, 2021

机译：GPU协处理器作为高能量物理学深度学习推理的服务

Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅