首页> 外文期刊>ACM transactions on Asian language information processing >An Automatic and a Machine-assisted Method to Clean Bilingual Corpus
【24h】

An Automatic and a Machine-assisted Method to Clean Bilingual Corpus

机译:自动和机器辅助方法清洗双语语料库

获取原文
获取原文并翻译 | 示例
       

摘要

Two different methods of corpus cleaning are presented in this article. One is a machine-assisted technique, which is good to clean small-sized parallel corpus, and the other is an automatic method, which is suitable for cleaning large-sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. The machine-assisted technique used two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments of this method are conducted on the English-to-Indian Language Machine Translation (EILMT) corpus (English-Hindi). The Bilingual Evaluation Understudy (BLEU) score is improved by 0.47% for the clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small-sized manually annotated parallel corpus of 206 sentence pairs. Experiments of this method are conducted on the HindEnCorp corpus, released in the workshop of the Association of Computational Linguistics (ACL 2014). The BLEU score is improved by 0.6% on clean corpus. A comparison of the two methods is also presented on EILMT corpus.
机译:本文介绍了两种不同的语料库清洗方法。一种是机器辅助技术,很适合清洗小尺寸的平行语料,另一种是自动方法,适合清洗大尺寸的平行语料。基线SMT(MOSES)系统用于评估这些方法。机器辅助技术使用了两个功能:单词对齐以及源语言和目标语言句子的长度。这些功能用于检测语料库中的错误翻译,然后由人工翻译人员处理。在英语到印度语言机器翻译(EILMT)语料库(英语-印度语)上进行了该方法的实验。清洁语料库的双语评估学习(BLEU)得分提高了0.47%。语料库清洗的自动方法结合了两个功能。一个特征是源语言和目标语言句子的长度,第二个特征是由隐马尔可夫模型为每个句子对生成的维特比对齐分数。这两个功能使用两个不同的阈值。这些值是通过使用206个句子对的小型手动注释并行语料库来确定的。这种方法的实验是在计算语言协会(ACL 2014)的研讨会上发布的HindEnCorp语料库上进行的。干净的语料库的BLEU分数提高了0.6%。 EILMT语料库上还提供了这两种方法的比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号