首页> 外文会议>Annual meeting of the Association for Computational Linguistics;International conference on computational linguistics;ICCL >Multext-East: Paralel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages
【24h】

Multext-East: Paralel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages

机译:多语东:平行和可比的语料库和词汇集,支持六种中欧和东欧语言

获取原文

摘要

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romainan, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwells Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to English (also tagged for POS). We describe the encoding format and data architecture designed especially for this corpus, which is generally usable for encoding linguistic corpora. We also describe the methodology for the development of a harmonized set of morphosyntactic descriptions (MSDs), which builds upon the scheme for western European languages developed within the EAGLES project. We discuss the special concerns for handling the six project languages, which cover three distinct language families.
机译:欧盟哥白尼项目Multext-East创建了一个多语言的文本和语音数据语料库,涵盖了该项目的六种语言:保加利亚语,捷克语,爱沙尼亚语,匈牙利语,罗曼南语和斯洛文尼亚语。此外,还开发了每种语言的词形词典。语料库包括一个由Orwells十九四十四组成的并行组件,所有六种语言的版本均标记为词性,并与英语对齐(也标记为POS)。我们描述了专门为此语料库设计的编码格式和数据体系结构,通常可用于对语言语料库进行编码。我们还描述了一套基于EAGLES项目开发的西欧语言方案的统一句法描述(MSD)的开发方法。我们讨论了处理六种项目语言的特殊问题,这些语言涵盖了三种不同的语言族。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号