首页> 外文会议>Workshop on building and using comparable corpora >A Generative Model for Extracting Parallel Fragments from Comparable Documents
【24h】

A Generative Model for Extracting Parallel Fragments from Comparable Documents

机译:一种用于从可比文档中提取并行片段的生成模型

获取原文

摘要

Although parallel corpora are essential language resources for many NLP tasks, they are rare or even not available for many language pairs. Instead, comparable corpora are widely available and contain parallel fragments of information that can be used applications like statistical machine translations. In this research, we propose a generative LDA based model for extracting parallel fragments from comparable documents without using any initial parallel data or bilingual lexicon. The experimental results show significant improvement if the extracted sentence fragments generated by the proposed method are used in addition to an existing parallel corpus in an SMT task. According to human judgment, the accuracy of the proposed method for an English-Persian task is about 66%. Also, the OOV rate for the same task is reduced by 28%.
机译:虽然并行Corpora是许多NLP任务的重要语言资源,但它们很少见或甚至不适用于许多语言对。相反,可比较的Corpora广泛可用,并包含可以使用统计机器翻译等应用程序的并行片段。在本研究中,我们提出了一种基于生成的LDA模型,用于在不使用任何初始并行数据或双语词典的情况下从可比文档中提取并行片段。如果使用由SMT任务中的现有并行语料库,使用由所提出的方法生成的提取的句子片段,则实验结果显示出显着的改进。根据人为判断,英国波斯任务的提议方法的准确性约为66%。此外,同一任务的OOV率降低了28%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号