【24h】

A Parallel Corpus of Theses and Dissertations Abstracts

机译:论文和论文的平行语料库摘要

获取原文

摘要

In Brazil, the governmental body responsible for overseeing and coordinating post-graduate programs, CAPES, keeps records of all theses and dissertations presented in the country. Information regarding such documents can be accessed online in the Theses and Dissertations Catalog (TDC), which contains abstracts in Portuguese and English, and additional metadata. Thus, this database can be a potential source of parallel corpora for the Portuguese and English languages. In this article, we present the development of a parallel corpus from TDC, which is made available by CAPES under the open data initiative, Approximately 240,000 documents were collected and aligned using the Hunalign tool. We demonstrate the capability of our developed corpus by training Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for both language directions, followed by a comparison with Google Translate (GT). Both translation models presented better BLEU scores than GT, with NMT system being the most accurate one. Sentence alignment was also manually evaluated, presenting an average of 82.30% correctly aligned sentences. Our parallel corpus is freely available in TMX format, with complementary information regarding document metadata.
机译:在巴西,负责监督和协调毕业生课程的政府机构,在该国提供的所有论文和论文记录。有关此类文件的信息可以在线和论文目录(TDC)在线访问,其中包含葡萄牙语和英语的摘要以及其他元数据。因此,该数据库可以是葡萄牙语和英语的平行语料库的潜在来源。在本文中,我们介绍了来自TDC的并行语料库的开发,它在开放数据计划下由斗篷提供,收集约240,000份文件并使用Hunalign工具对齐。我们通过培训统计机器翻译(SMT)和神经电机翻译(NMT)模型来展示我们开发的语料库的能力,以及语言方向的模型,然后与Google Translate(GT)进行比较。两种翻译模型都呈现比GT更好的BLEU分数,NMT系统是最准确的。句子对齐也是手动评估,呈现平均为82.30%正确对齐的句子。我们的并行语料库是以TMX格式自由提供的,具有关于文档元数据的互补信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号