首页> 外文会议>Conference on Empirical Methods in Natural Language Processing >Improving Multilingual Models with Language-Clustered Vocabularies
【24h】

Improving Multilingual Models with Language-Clustered Vocabularies

机译:用语言聚类词汇改进多语言模型

获取原文

摘要

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TYDI QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 Fl) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.
机译:最先进的多语言模型依赖于涵盖模型预期在推理时间的所有语言的词汇表,但生成这些词汇的标准方法并不适用于大量多语言应用。在这项工作中,我们为多语言词汇表介绍了一种组合多种自动派生语言集群的单语语言词汇的过程,从而平衡了交叉子字共享和语言特定词汇表之间的权衡。我们的实验表明,在关键的多语言基准任务TYDI QA(+2.9 F1),XNLI(+ 2.1%)和Wikiann ner(+2.8FL)和失控率的因数不增加的因素模型或数据的大小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号