首页> 外文会议>Workshop on building and using comparable corpora >Chinese-Japanese Parallel Sentence Extraction from Quasi-Comparable Corpora
【24h】

Chinese-Japanese Parallel Sentence Extraction from Quasi-Comparable Corpora

机译:准可比语料库中日平行句的提取

获取原文

摘要

Parallel sentences are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese-Japanese. Many studies have been conducted on extracting parallel sentences from noisy parallel or comparable corpora. We extract Chinese-Japanese parallel sentences from quasi-comparable corpora, which are available in far larger quantities. The task is significantly more difficult than the extraction from noisy parallel or comparable corpora. We extend a previous study that treats parallel sentence identification as a binary classification problem. Previous method of classifier training by the Cartesian product is not practical, because it differs from the real process of parallel sentence extraction. We propose a novel classifier training method that simulates the real sentence extraction process. Furthermore, we use linguistic knowledge of Chinese character features. Experimental results on quasi-comparable corpora indicate that our proposed approach performs significantly better than the previous study.
机译:平行句子对于统计机器翻译(SMT)至关重要。但是,对于大多数语言对(例如,日语-日语)来说,它们是相当稀缺的。关于从嘈杂的平行语料库或类似语料库中提取平行句,已经进行了许多研究。我们从准可比语料库中提取中日平行句子,该句子的数量要大得多。该任务比从嘈杂的并行语料库或类似语料库中提取要困难得多。我们扩展了先前的研究,该研究将并行句子识别视为二进制分类问题。笛卡尔积的先前分类器训练方法不切实际,因为它不同于并行句子提取的实际过程。我们提出了一种新颖的分类器训练方法,可以模拟真实句子的提取过程。此外,我们使用汉字特征的语言知识。在准可比语料库上的实验结果表明,我们提出的方法的性能明显优于先前的研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号