...
首页> 外文期刊>International journal of data mining, modelling and management >Annotation tools for syntax and named entities in the national corpus of Polish
【24h】

Annotation tools for syntax and named entities in the national corpus of Polish

机译:波兰语国家语料库中用于语法和命名实体的注释工具

获取原文
获取原文并翻译 | 示例
           

摘要

The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.
机译:正在进行的波兰国家语料库项目假设了几种语言注释。我们介绍了为三个较高的注释级别开发的技术环境和方法论背景:语法单词,句法组和命名实体的级别。我们将展示如何将基于知识的平台Spejd和Sprout用于语料库的自动预注释,并讨论在准备语法分析器时遇到的一些特殊问题,该语法包含1,000多个规则,是波兰语最大的分块语法之一。我们还展示了树形编辑器TrEd是如何自定义的,用于手动后期注释编辑和进一步修订差异。我们的XML格式转换器和自定义的归档存储库可确保自动数据流和有效的语料库文件管理。我们讨论了手动注释数据中的注释者间协议,并介绍了在这些数据上训练的CRF分类器的第一个结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号