【24h】

Semi-supervised Learning for Mongolian Morphological Segmentation

机译:半监督学习的蒙古语形态分割

获取原文

摘要

Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists, we explore a novel semi-supervised method for a practical application, i.e., statistical machine translation (SMT), based on a low-resource learning setting, in which a small amount of labeled data and large amount of unlabeled data are available. First, a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data. Then, a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data. Finally, we present some error correction models to revise segmentation results. Experimental results show that our method can improve the segmentation results compared with the pure supervised learning. Besides, we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.
机译:与以前的蒙古语形态学分割方法基于大量标注的训练数据或语言学家得出的复杂规则不同,我们探索了一种新颖的半监督方法进行实际应用,即基于资源匮乏的学习环境的统计机器翻译(SMT),其中少量的标记数据和大量的未标记数据是可用的。首先,利用基于CRF的监督学习通过使用小的标记数据来预测词素边界。然后,使用带有小标记数据作为启发式信息的基于词典的分割模型,以第一步通过大量未标记数据来补偿弱点。最后,我们提出了一些误差校正模型来修正分割结果。实验结果表明,与纯监督学习相比,该方法可以提高分割效果。此外,我们将形态分割结果整合到了中蒙SMT中,与基线相比取得了令人满意的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号