首页> 外文期刊>ACM transactions on Asian language information processing >Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia
【24h】

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia

机译:相似语料库中的并行句和片段的集成提取:以中日维基百科为例

获取原文
获取原文并翻译 | 示例

摘要

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese-Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance.
机译:并行语料对于统计机器翻译(SMT)至关重要。但是,对于大多数语言对和领域来说,它们是相当稀缺的。由于可比语料库越来越多,因此进行了许多研究以从中提取平行句子或片段以进行SMT。在本文中,我们提出了一个集成系统,可从可比语料库中提取平行句子和片段。我们首先应用平行句子提取以从可比较的句子中识别平行句子。然后,我们从可比句子中提取平行片段。并行句子提取基于并行句子候选过滤器和分类器进行并行句子识别。通过提出一种新颖的过滤策略和三个新颖的分类特征集,我们对其进行了改进。先前的研究发现很难从可比较的句子中准确提取平行片段。我们提出了一种精确的平行片段提取方法,该方法使用对齐模型来定位候选平行片段,并使用基于词典的精确过滤器来识别真正的平行片段。在中日维基百科上进行的一项案例研究表明,我们提出的方法优于先前提出的方法,并且我们系统提取的并行数据显着提高了SMT性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号