首页> 外文会议>Annual Meeting of the Association for Computational Linguistics >A Mixture of h - 1 Heads is Better than h Heads
【24h】

A Mixture of h - 1 Heads is Better than h Heads

机译:h-1头的混合物比h头好

获取原文

摘要

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameter-ized; attention heads can be pruned without significant performance loss. In this work, we instead "reallocate" them-the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs1.
机译:多头专注神经结构在各种自然语言处理任务上取得了最新的成果。有证据表明,它们被过度参数化;注意力集中的人可以在没有显著表现损失的情况下被修剪掉。在这项工作中,我们转而“重新分配”它们——模型学习在不同的输入上激活不同的头部。借鉴多头注意与专家混合的关系,我们提出了注意力专家混合模型(MAE)。MAE使用块坐标下降算法进行训练,该算法在更新(1)专家职责和(2)专家参数之间交替进行。机器翻译和语言建模实验表明,MAE在这两项任务上都优于强基线。特别是,在WMT14英语到德语翻译数据集上,MAE比“transformer base”提高了0.8 BLEU,参数数量相当。我们的分析表明,我们的模型可以让不同的专家专门处理不同的输入1。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号