【24h】

Building the Spanish-Croatian Parallel Corpus

机译:建立西班牙语 - 克罗地亚的并行语料库

获取原文

摘要

This paper describes the building of the first Spanish-Croatian unidirectional parallel corpus, which has been constructed at the Faculty of Humanities and Social Sciences of the University of Zagreb. The corpus is comprised of eleven Spanish novels and their translations to Croatian done by six different professional translators. All the texts were published between 1999 and 2012. The corpus has more than 2 Mw, with approximately 1 Mw for each language. It was automatically sentence segmented and aligned, as well as manually post-corrected, and contains 71,778 translation units. In order to protect the copyright and to make the corpus available under permissive CC-BY licence, the aligned translation units are shuffled. This limits the usability of the corpus for research of language units at sentence and lower language levels only. There are two versions of the corpus in TMX format that will be available for download through META-SHARE and CLARIN ERIC infrastructure. The former contains plain TMX, while the latter is lemmatised and POS-tagged and stored in the aTMX format.
机译:本文介绍了第一个西班牙语 - 克罗地亚单向并行语料库的建设,该公司已经在萨格勒布大学的人文学科和社会科学教堂构建。核肉由11种不同的专业翻译人员完成的11种西班牙小说及其翻译。所有文本都在1999年至2012年间发布。药物有超过2兆瓦,每种语言大约1兆瓦。它是自动句子分段并对齐,以及手动后纠正,并包含71,778个翻译单位。为了保护版权并使语料库可在允许的CC授权下提供,对齐的翻译单位正在洗牌。这限制了语料库的可用性,以便仅在句子和较低的语言级别的语言单元的研究。 TMX格式有两个版本的语料库,可通过Meta-Share和Clarin Eric基础设施下载。前者包含普通TMX,而后者是lemmated和POS标记的,并以ATMX格式存储。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号