Chinese word segmentation is the first phase in building statistical machine translation (SMT) systems from Chinese into other languages. But the Chinese word segmenters trained from monolingual corpus are not necessarily suitable for SMT systems. Therefore, it is necessary to build a MT-motivated Chinese word segmenter in order to improve the quality of translation. In the paper, we incorporate two kinds of knowledge to train a Chinese word segmenter: the first comes from the Chinese-character-based bilingual alignment; and the other comes from conventional monolingual Chinese word segmentation. Both kinds of knowledge are jointly employed to train a MT-motivated word segmenter using Conditional Random Fields. In the experiment, we segment the Chinese portions of the training, development and test sets with the proposed segmenter, and built a phrase-based machine translation system. The results show an effective improvement over the baselines in terms of translation quality.%汉语分词是搭建汉语到其他语言的统计机器翻译系统的一项重要工作.从单语语料中训练得到的传统分词模型并不一定完全适合机器翻译[1].该文提出了一种基于单语和双语知识的适应于统计机器翻译系统的分词方法.首先利用对齐可信度的概念从双语字对齐语料中抽取可信对齐集合,然后根据可信对齐集合对双语语料中的中文部分重新分词;接着将重新分词的结果和单语分词工具的分词结果相融合,得到新的分词结果,并将其作为训练语料,利用条件随机场模型训练出一个融合了单双语知识的分词工具.该文用该工具对机器翻译所需的训练集、开发集和测试集进行分词,并在基于短语的统计机器翻译系统上进行实验.实验结果表明,该文所提的方法提高了系统性能.
展开▼