首页> 外文期刊>Computer speech and language >Linguistically-augmented perplexity-based data selection for language models
【24h】

Linguistically-augmented perplexity-based data selection for language models

机译:基于语言增强的困惑度的语言模型数据选择

获取原文
获取原文并翻译 | 示例
           

摘要

This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naive selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese). The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in the range 7.72-13.02%. In absolute terms the reduction is higher for languages with high type-token ratio (Chinese, 202.16) or rich morphology (Czech, 81.53) and lower for the remaining languages, Spanish (55.2) and English (34.43 on the English side of the same parallel dataset as for Czech and 61.90 on the same parallel dataset as for Spanish).
机译:本文探讨了使用语言信息来选择训练语言模型的数据。我们在基于困惑的数据选择中偏离了最新方法,并对其进行了扩展,以便使用单词级语言单位(即引理,命名实体类别和词性标签)代替表面形式。然后,我们提出了两种结合了不同类型的语言知识和表面形式的方法(1,对每种方法选择的排名最高的句子进行天真选择; 2,对不同方法选择的数据集进行线性插值)。本文介绍了四种形态复杂程度不同的语言(英语,西班牙语,捷克语和中文)的详细结果和分析。在所有情况下,基于插值的组合均优于纯统计基线,从而导致语言模型的困惑度较低。相对而言,无论使用哪种语言,改进都是相似的,其困惑度降低了7.72-13.02%。绝对而言,对于具有高类型标记比(中文,202.16)或丰富形态(Czech,81.53)的语言,降低幅度更大,而对于其余语言,西班牙语(55.2)和英语(同英文端则为34.43)降低得更低。平行数据集与捷克文相同,而平行数据集与西班牙文相同则为61.90)。

著录项

  • 来源
    《Computer speech and language》 |2015年第1期|11-26|共16页
  • 作者单位

    School of Computing, Dublin City University, Dublin, Ireland;

    Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic;

    Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau S.A.R., China;

    DFKI GmbH, Multilingual Technologies, Campus D3 2, D-66123 Saarbruecken, Germany;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Data selection; Language modelling; Computational linguistics;

    机译:数据选择;语言建模;计算语言学;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号