Patient Knowledge Distillation for BERT Model Compression

机译：BERT模型压缩的患者知识蒸馏

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently leams from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: (ⅰ) PKD-Last: learning from the last k layers; and (ⅱ) PKD-Skip: learning from every fc layers. These two patient distillation schemes enable the exploitation of rich information in the teacher's hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multilayer distillation process. Empirically, this translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy.~1

机译：事实证明，像BERT这样的预训练语言模型对于自然语言处理（NLP）任务非常有效。但是，训练此类模型对计算资源的高需求阻碍了它们在实践中的应用。为了减轻大规模模型训练中的资源匮乏，我们提出了一种患者知识提取方法，将原始的大型模型（教师）压缩为同样有效的轻量级浅层网络（学生）。与以前的知识提炼方法不同，之前的知识提炼方法仅使用教师网络最后一层的输出进行提炼，我们的学生模型耐心地从教师模型的多个中间层中学习以进行增量知识提取，方法有两种：（ⅰ）PKD-最后：从最后k层学习; （ⅱ）PKD-Skip：从每个fc层学习。这两种患者蒸馏方案可以利用教师隐藏层中的丰富信息，并鼓励学生模型通过多层蒸馏过程耐心地向老师学习和模仿老师。从经验上讲，这可以转化为多个NLP任务的改进结果，并且在不牺牲模型精度的情况下显着提高了训练效率。〜1

著录项

来源
《International joint conference on natural language processing;Conference on empirical methods in natural language processing》|2019年|4322-4331|共10页
会议地点
作者
Siqi Sun; Yu Cheng; Zhe Gan; Jingjing Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Model compression via pruning and knowledge distillation for person re-identification [J] . Xie Haonan, Jiang Wei, Luo Hao, Journal of ambient intelligence and humanized computing . 2021,第2期

机译：通过修剪和知识蒸馏模型压缩来重新识别
2. Parallel Blockwise Knowledge Distillation for Deep Neural Network Compression [J] . Blakeney Cody, Li Xiaomin, Yan Yan, IEEE Transactions on Parallel and Distributed Systems . 2021,第7期

机译：深度神经网络压缩的平行群体知识蒸馏
3. Modelling and Simulation of the Multi-effect/Thermal Vapor Compression Distillation Process [J] . Khalid Bamardouf, Osman Ahmed Hamed, Amro Mohammed Mahmoud International Journal of Mechanical Engineering and Applications . 2020,第4期

机译：多效/热蒸汽压缩蒸馏工艺的建模与仿真
4. Patient Knowledge Distillation for BERT Model Compression [C] . Siqi Sun, Yu Cheng, Zhe Gan, International joint conference on natural language processing . 2019

机译：伯特模型压缩的患者知识蒸馏
5. Reduced Pressure Distillation of Wood Derived Pyrolysis Liquid Biofuels and their Emulsion with Diesel Fuel in a Compression Ignition, Indirect Injection Engine. [D] . Sookrah, Vijai Christopher. 2016

机译：压缩点火，间接喷射发动机中的木材衍生热解液体生物燃料及其与柴油的乳化液的减压蒸馏。
6. Deep Unsupervised Hashing for Large-Scale Cross-Modal Retrieval Using Knowledge Distillation Model [O] . Mingyong Li, Qiqi Li, Lirong Tang, 2021

机译：使用知识蒸馏模型进行大规模交叉模态检索的深度无监督散列
7. Knowledge Distillation Beyond Model Compression [O] . Fahad Sarfraz, Elahe Arani, Bahram Zonooz 2021

机译：超越模型压缩的知识蒸馏

Patient Knowledge Distillation for BERT Model Compression

摘要

著录项

相似文献

相关主题

期刊订阅