首页> 中文期刊>中文信息学报 >一种适用于机器翻译的汉语分词方法

一种适用于机器翻译的汉语分词方法

     

摘要

Chinese word segmentation is the first phase in building statistical machine translation (SMT) systems from Chinese into other languages. But the Chinese word segmenters trained from monolingual corpus are not necessarily suitable for SMT systems. Therefore, it is necessary to build a MT-motivated Chinese word segmenter in order to improve the quality of translation. In the paper, we incorporate two kinds of knowledge to train a Chinese word segmenter: the first comes from the Chinese-character-based bilingual alignment; and the other comes from conventional monolingual Chinese word segmentation. Both kinds of knowledge are jointly employed to train a MT-motivated word segmenter using Conditional Random Fields. In the experiment, we segment the Chinese portions of the training, development and test sets with the proposed segmenter, and built a phrase-based machine translation system. The results show an effective improvement over the baselines in terms of translation quality.%汉语分词是搭建汉语到其他语言的统计机器翻译系统的一项重要工作.从单语语料中训练得到的传统分词模型并不一定完全适合机器翻译[1].该文提出了一种基于单语和双语知识的适应于统计机器翻译系统的分词方法.首先利用对齐可信度的概念从双语字对齐语料中抽取可信对齐集合,然后根据可信对齐集合对双语语料中的中文部分重新分词;接着将重新分词的结果和单语分词工具的分词结果相融合,得到新的分词结果,并将其作为训练语料,利用条件随机场模型训练出一个融合了单双语知识的分词工具.该文用该工具对机器翻译所需的训练集、开发集和测试集进行分词,并在基于短语的统计机器翻译系统上进行实验.实验结果表明,该文所提的方法提高了系统性能.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号