【24h】

Using Related Languages to Enhance Statistical Language Models

机译:使用相关语言增强统计语言模型

获取原文

摘要

The success of many language modeling methods and applications relies heavily on the amount of data available. This problem is further exacerbated in statistical machine translation, where parallel data in the source and target languages is required. However, large amounts of data are only available for a small number of languages; as a result, many language modeling techniques are inadequate for the vast majority of languages. In this paper, we attempt to lessen the problem of a lack of training data for low-resource languages by adding data from related high-resource languages in three experiments. First, we interpolate language models trained on the target language and on the related language. In our second experiment, we select the sentences most similar to the target language and add them to our training corpus. Finally, we integrate data from the related language into a translation model for a statistical machine translation application. Although we do not see many significant improvements over baselines trained on a small amount of data in the target language, we discuss some further experiments that could be attempted in order to augment language models and translation models with data from related languages.
机译:许多语言建模方法和应用程序的成功很大程度上取决于可用的数据量。在需要源语言和目标语言的并行数据的统计机器翻译中,此问题会进一步加剧。但是,大量数据仅适用于少数几种语言。结果,许多语言建模技术不足以适用于绝大多数语言。在本文中,我们尝试通过在三个实验中添加来自相关高资源语言的数据来减轻低资源语言缺乏训练数据的问题。首先,我们对在目标语言和相关语言上训练的语言模型进行插值。在第二个实验中,我们选择与目标语言最相似的句子并将其添加到我们的训练语料库中。最后,我们将相关语言的数据集成到统计模型翻译应用程序的翻译模型中。尽管我们看不到在使用目标语言的少量数据进行训练的基线基础上有很多显着改进,但是我们讨论了可以尝试的其他一些实验,以便使用来自相关语言的数据来扩展语言模型和翻译模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号