【24h】

A Diachronic Corpus for Romanian (RoDia)

机译:罗马尼亚语的历时性语料库(RoDia)

获取原文
获取原文并翻译 | 示例

摘要

This paper describes a Romanian Dependency Treebank, built at the Al. I. Cuza University (UAIC), and a special OCR techniques used to build it. The corpus has rich morphological and syntactic annotation. There are few annotated representative corpora in Romanian, and the existent ones are mainly focused on the contemporary Romanian standard. The corpus described below is focused on the non-standard aspects of the language, the Regional and the Old Romanian. Having the intention to participate at the PROIEL project, which aligns oldest New Testaments, we annotate the first printed Romanian New Testament (Alba Iulia, 1648). We began by applying the UAIC tools for the morphological and syntactic processing of Contemporary Romanian over the books first quarter (second edition). By carefully manually correcting the result of the automated annotation (having a modest accuracy) we obtained a sub-corpus for the training of tools for the Old Romanian processing. But the first edition of the New Testament is written in Cyrillic letters. The existence of books printed in the Old Cyrillic alphabet is a common problem for Romania and The Republic of Moldova, countries where the Romanian is spoken; a problem to solve by the joint efforts of the NLP researchers in the two countries.
机译:本文介绍了在Al建的罗马尼亚依赖树库。 I.库萨大学(UAIC),以及用于构建它的特殊OCR技术。语料库具有丰富的词法和句法注释。罗马尼亚语中带注释的代表语料库很少,而现有的语料库主要集中在当代罗马尼亚语标准上。下面描述的语料库集中在语言的非标准方面,区域语言和旧罗马尼亚语。为了参与最古老的新约《 PROIEL》项目,我们注释了第一本印刷的罗马尼亚《新约》(阿尔巴·尤利亚,1648年)。从第一季度(第二版)开始,我们首先将UAIC工具应用于当代罗马尼亚语的形态和句法处理。通过仔细地手动校正自动标注的结果(准确性适中),我们获得了一个子语料库,用于训练罗马尼亚语处理的工具。但是新约的第一版是用西里尔字母写的。对于罗马尼亚和说罗马尼亚语的国家摩尔多瓦共和国来说,以旧西里尔字母印刷的书籍的存在是一个普遍的问题。 NLP研究人员在两国的共同努力下解决了这个问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号