首页> 外文会议>International Conference on Text, Speech and Dialogue >Corpus Annotation Pipeline for Non-standard Texts
【24h】

Corpus Annotation Pipeline for Non-standard Texts

机译:非标准文本的语料库注释管道

获取原文
获取外文期刊封面目录资料

摘要

According to some estimations (e.g. [9]), web corpora contain over 6% of foreign material (borrowings, language mixing, named entities). Since annotation pipelines are usually built upon standard and correct data, the resulting annotation of web corpora often contains serious errors. We studied in depth annotation errors of the web corpus czTenTen 12 and proposed an extension to the tagger desamb that had been used for czTenTen annotation. First, the subcorpus was made using the most problematic documents from czTenTen. Second, measures were established for the most frequent annotation errors. Third, we established several experiments in which we extended the annotation pipeline so it could annotate foreign material and multi-word expressions. Finally, we compared the new annotations of the subcorpus with the original ones.
机译:根据一些估计(例如[9]),Web Corpora包含超过6%的外国物质(借款,语言混合,命名实体)。由于注释管道通常基于标准和正确的数据构建,因此由此产生的Web Corla的注释通常包含严重错误。我们在Web Corpus Cztenten 12的深度注释误差中研究,并提出了已用于CZTenten注释的标签Desamb的扩展。首先,使用来自Cztenten的最有问题的文献来制作Subcorpus。其次,为最常用的注释错误建立了措施。第三,我们建立了几个实验,其中我们扩展了注释管道,因此它可以注释外国材料和多字的表达。最后,我们将subcorpus与原始的新注释进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号