首页> 外文会议>9th International conference on language resources and evaluation >Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus
【24h】

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus

机译:数十亿免费单词:建立和使用欧盟书店语料库

获取原文

摘要

The European Union is a great source of high quality documents with translations into several languages. Parallel corpora from its publications are frequently used in various tasks, machine translation in particular. A source that has not systematically been explored yet is the EU Bookshop - an online service and archive of publications from various European institutions. The service contains a large body of publications in the 24 official of the EU. This paper describes our efforts in collecting those publications and converting them to a format that is useful for natural language processing in particular statistical machine translation. We report our procedure of crawling the website and various pre-processing steps that were necessary to clean up the data after the conversion from the original PDF files. Furthermore, we demonstrate the use of this datasct in training SMT models for English, French, German, Spanish, and Latvian.
机译:欧盟是翻译成多种语言的高质量文件的重要来源。其出版物中的平行语料库经常用于各种任务,尤其是机器翻译。欧盟书店(EU Bookshop)尚未得到系统地探索,该书店是在线服务和来自欧洲各机构的出版物的存档。该服务包含欧盟24个官员中的大量出版物。本文介绍了我们在收集这些出版物并将其转换为对自然语言处理(特别是统计机器翻译)有用的格式方面所做的工作。我们报告了爬网网站的过程以及从原始PDF文件转换后清理数据所需的各种预处理步骤。此外,我们展示了此datasct在训练英语,法语,德语,西班牙语和拉脱维亚语的SMT模型中的用途。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号