首页> 外文会议>4th Workshop on building and using comparable corpora: comparable corpora and the web 2011 >Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia
【24h】

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia

机译:从大量的双语文本中识别并行文档:维基百科中并行文章提取的应用

获取原文
获取原文并翻译 | 示例

摘要

While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present Paradocs, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and PARADOCS identifies parallel or noisy parallel article pairs with a precision of 80%.
机译:虽然最近有几本关于处理大量双语文本的著作,例如(Smith et al。,2010),为从可比语料库中提取平行句子,我们提出了Paradocs,这是一种旨在识别(大型)双语文本集中的平行文件对的系统。我们表明,在许多受控任务中,该系统的性能优于公平基准(Enright和Kondrak,2007年)。我们将其应用于Wikipedia的法语-英语跨语言链接文章对,以查看此资源中是否有平行文章,以及我们的系统是否能够找到它们。根据我们进行的一些人工评估,Wikipedia中确实有四分之一的文章对具有翻译关系,并且PARADOCS以80%的精度识别平行或嘈杂的平行文章对。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号