首页> 外文期刊>Journal of Language Modelling >The Bulgarian National Corpus: Theory and Practice in Corpus Design
【24h】

The Bulgarian National Corpus: Theory and Practice in Corpus Design

机译:保加利亚国家语料库:语料库设计的理论与实践

获取原文
       

摘要

The paper discusses several key concepts related to the development of corpora and reconsiders them in light of recent developments in NLP. On the basis of an overview of present-day corpora, we conclude that the dominant practices of corpus design do not utilise adequately the technologies and, as a result, fail to meet the demands of corpus linguistics, computational lexicology and computational linguistics alike.We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies?allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilingual texts and on providing them with a detailed metadata description and high-quality multi-level annotation.We go on to illustrate this concept with a description of the compilation, structuring, documentation, and annotation of the Bulgarian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 billion words altogether. The BulNC is supplied with a comprehensive metadata description, which allows us to organise the texts according to different principles. The Bulgarian part of the BulNC is automatically processed (tokenised and sentence split) and ?annotated at several levels: morphosyntactic tagging, lemmatisation, word-sense annotation, annotation of noun phrases and named entities. Some levels of annotation are also applied to the Bulgarian-English parallel corpus with the prospect of expanding multilingual annotation both in terms of linguistic levels and the number of languages for which it is available. We conclude with a brief evaluation of the quality of the corpus and an outline of its applications in NLP and linguistic research.
机译:本文讨论了与语料库发展相关的几个关键概念,并根据NLP的最新发展重新考虑了它们。在对当前语料库进行概述的基础上,我们得出结论,语料库设计的主要实践未充分利用技术,因此无法满足语料库语言学,计算词汇学和计算语言学的要求。继续设计一种数据驱动的语料库设计方法,该方法将传统语料库语言学的最佳实践与最新技术的潜力相结合-允许快速收集,自动元数据描述和批注大量数据。因此,我们提出的方法的要点是语料库设计应集中在大量单语和多语种文本的收集上,并为它们提供详细的元数据描述和高质量的多级注释。概念,并描述了保加利亚国家语料库(BulNC)的汇编,结构,文档和注解。目前,它由构成语料库内核的9.796亿个单词的保加利亚语部分和33个保加利亚语-X语言语料库组成,总计9.723亿个单词,总计19.5亿个单词。 BulNC提供了全面的元数据描述,使我们能够根据不同的原则来组织文本。 BulNC的保加利亚语部分会自动进行处理(标记化和句子拆分),并在几个级别添加注释:句法标记,词义化,词义注释,名词短语和命名实体的注释。注释的某些级别也应用于保加利亚-英语并行语料库,并有望在语言级别和可用语言的数量方面扩展多语言注释。最后,我们对语料库的质量进行了简要评估,并概述了语料库在自然语言处理和语言研究中的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号