首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
【24h】

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

机译:分析多头自我注意:专业头做重物,其余部分可以修剪

获取原文

摘要

Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L_(0) penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.
机译:多头自我注意是Transformer(神经机器翻译的最新架构)的关键组成部分。在这项工作中,我们评估了编码器中各个关注头对模型整体性能的贡献,并分析了它们所扮演的角色。我们发现,最重要和最自信的负责人起着一致的作用,并且通常在语言上可以解释。当使用基于随机门和L_(0)罚分的微分松弛的方法修剪磁头时,我们观察到最后要修剪专用磁头。我们新颖的修剪方法可去除绝大多数刀头,而不会严重影响性能。例如,在英俄WMT数据集上,修剪48个编码器头中的38个导致下降仅0.15 BLEU。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号