【24h】

Patient Knowledge Distillation for BERT Model Compression

机译:BERT模型压缩的患者知识蒸馏

获取原文

摘要

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently leams from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: (ⅰ) PKD-Last: learning from the last k layers; and (ⅱ) PKD-Skip: learning from every fc layers. These two patient distillation schemes enable the exploitation of rich information in the teacher's hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multilayer distillation process. Empirically, this translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy.~1
机译:事实证明,像BERT这样的预训练语言模型对于自然语言处理(NLP)任务非常有效。但是,训练此类模型对计算资源的高需求阻碍了它们在实践中的应用。为了减轻大规模模型训练中的资源匮乏,我们提出了一种患者知识提取方法,将原始的大型模型(教师)压缩为同样有效的轻量级浅层网络(学生)。与以前的知识提炼方法不同,之前的知识提炼方法仅使用教师网络最后一层的输出进行提炼,我们的学生模型耐心地从教师模型的多个中间层中学习以进行增量知识提取,方法有两种:(ⅰ)PKD-最后:从最后k层学习; (ⅱ)PKD-Skip:从每个fc层学习。这两种患者蒸馏方案可以利用教师隐藏层中的丰富信息,并鼓励学生模型通过多层蒸馏过程耐心地向老师学习和模仿老师。从经验上讲,这可以转化为多个NLP任务的改进结果,并且在不牺牲模型精度的情况下显着提高了训练效率。〜1

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号