首页> 外文会议>International workshop on treebanks and linguistic theories >REALEC learner treebank: annotation principles and evaluation of automatic parsing
【24h】

REALEC learner treebank: annotation principles and evaluation of automatic parsing

机译:REALEC学习者树库:注释原理和自动解析的评估

获取原文

摘要

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The original corpus is manually annotated for learners' errors and gives information on the error span, error type, and the possible correction of the mistake provided by experts. The syntactic dependency annotation adds more value to learner corpora since it makes it possible to explore the interaction of syntax and different types of errors. Also, it helps to assess the syntactic complexity of learners' texts. While adjusting existing dependency parsing tools, one has to take into account to what extent students' mistakes provoke errors in the parser output. The ungrammatical and stylistically inappropriate utterances may challenge parsers' algorithms trained on grammatically appropriate academic texts. In our experiments, we compared the output of the dependency parser Ud-pipe (trained on ud-english 2.0) with the results of manual parsing, placing a particular focus on parses of ungrammatical English clauses. We show how mistakes made by students influence the work of the parser. Overall, Ud-pipe performed reasonably well (UAS 92.9, LAS 91.7). We provide the analysis of several cases of erroneous parsing which are due to the incorrect detection of a head, on the one hand, and with the wrong choice of the relation type, on the other hand. We propose some solutions which could improve the automatic output and thus make the syntax-based learner corpus research and assessment of the syntactic complexity more reliable. The REALEC treebank is freely available under the CC BY-SA 3.0 licence.
机译:本文提出了一种针对学习者英语语料库的通用依赖关系(UD)注释方案。 REALEC数据集包含由俄语国家的大学生在普通英语课程中用英语撰写的论文。原始语料库会针对学习者的错误进行手动注释,并提供有关错误跨度,错误类型以及专家可能提供的错误更正的信息。语法依赖注释为学习者语料库增加了更多价值,因为它使探索语法和不同类型错误的相互作用成为可能。而且,它有助于评估学习者文本的句法复杂性。在调整现有的依存关系分析工具时,必须考虑到学生的错误在多大程度上引起了解析器输出中的错误。不合语法和风格上不适当的话语可能会挑战语法适当的学术课本上训练的解析器算法。在我们的实验中,我们将依赖关系解析器Ud-pipe(在ud-english 2.0上进行了训练)的输出与手动解析的结果进行了比较,特别着重于非语法英语子句的解析。我们展示了学生犯的错误如何影响解析器的工作。总体而言,Ud管道的性能相当不错(UAS 92.9,LAS 91.7)。我们提供了几种错误解析的分析,一方面是由于错误地检测到一个头,另一方面是由于错误地选择了关系类型。我们提出了一些可以改善自动输出的解决方案,从而使基于语法的学习者语料库的研究和语法复杂性的评估更加可靠。根据CC BY-SA 3.0许可,可以免费使用REALEC树库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号