首页> 外文会议>International conference on computational processing of portuguese >A Parallel Corpus of Theses and Dissertations Abstracts
【24h】

A Parallel Corpus of Theses and Dissertations Abstracts

机译:论文和论文摘要的平行语料库

获取原文

摘要

In Brazil, the governmental body responsible for overseeing and coordinating post-graduate programs, CAPES, keeps records of all theses and dissertations presented in the country. Information regarding such documents can be accessed online in the Theses and Dissertations Catalog (TDC), which contains abstracts in Portuguese and English, and additional metadata. Thus, this database can be a potential source of parallel corpora for the Portuguese and English languages. In this article, we present the development of a parallel corpus from TDC, which is made available by CAPES under the open data initiative. Approximately 240,000 documents were collected and aligned using the Hunalign tool. We demonstrate the capability of our developed corpus by training Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for both language directions, followed by a comparison with Google Translate (GT). Both translation models presented better BLEU scores than GT, with NMT system being the most accurate one. Sentence alignment was also manually evaluated, presenting an average of 82.30% correctly aligned sentences. Our parallel corpus is freely available in TMX format, with complementary information regarding document metadata.
机译:在巴西,负责监督和协调研究生课程的政府机构CAPES记录了该国提交的所有论文和学位论文的记录。有关此类文档的信息可以在论文和学位论文目录(TDC)中在线访问,该目录包含葡萄牙语和英语的摘要以及其他元数据。因此,该数据库可能是葡萄牙语和英语的并行语料库的潜在来源。在本文中,我们介绍了来自TDC的并行语料库的开发,CAPES在开放数据倡议下提供了该语料库。使用Hunalign工具收集并对齐了大约240,000个文档。我们通过训练两种语言方向的统计机器翻译(SMT)和神经机器翻译(NMT)模型,然后与Google翻译(GT)进行比较,证明了我们开发的语料库的功能。两种翻译模型的BLEU得分都比GT高,其中NMT系统是最准确的。句子对齐方式也经过人工评估,平均正确对齐的句子平均占82.30%。我们的并行语料库可以TMX格式免费提供,并提供有关文档元数据的补充信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号