首页> 外文期刊>Language Resources and Evaluation >MULTEXT-East: morphosyntactic resources for Central and Eastern European languages
【24h】

MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

机译:MULTEXT-East:中欧和东欧语言的语态句法资源

获取原文
获取原文并翻译 | 示例
           

摘要

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel "1984" by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.
机译:本文介绍了MULTEXT-East语言资源,这是一种用于语言工程研究的多语言数据集,重点是语言描述的形态句法层次。 MULTEXT-East数据集包括句法语法规范,词法语法词典和平行语料库,乔治·奥威尔(George Orwell)的小说“ 1984”,句子对齐,并包含经过手工验证的词法语法描述和引理。资源使用文本编码倡议指南(TEI P5)以XML统一编码,涵盖16种语言,主要来自中欧和东欧:保加利亚语,克罗地亚语,捷克语,英语,爱沙尼亚语,匈牙利语,马其顿语,波斯语,波兰语,Resian,罗马尼亚语,俄语,塞尔维亚语,斯洛伐克语,斯洛文尼亚语和乌克兰语。该数据集在所覆盖的语言和丰富的编码方面是独一无二的,已被广泛记录并可以免费用于研究目的。本文按类型和语言概述了MULTEXT-East资源,并提供了一些结论和进一步工作的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号