首页> 外文期刊>International Journal on Document Analysis and Recognition >Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences
【24h】

Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences

机译:树库变糟:解析器评估和使用语法错误的树库进行再训练

获取原文
获取原文并翻译 | 示例
       

摘要

This article describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minimal transformation of the original analyses in the tree-bank so that they describe the newly created ill-formed sentences. Such a treebank can be used to test how well a parser is able to ignore grammatical errors in texts (as people do), and can be used to induce a grammar capable of analysing such sentences. This article demonstrates these two applications using the Penn Treebank. In a robustness evaluation experiment, two state-of-the-art statistical parsers are evaluated on an ungrammatical version of Sect. 23 of the Wall Street Journal (WSJ) portion of the Penn treebank. This experiment shows that the performance of both parsers degrades with grammatical noise. A breakdown by error type is provided for both parsers. A second experiment retrains both parsers using an ungrammatical version of WSJ Sections 2-21. This experiment indicates that an ungrammatical treebank is a useful resource in improving parser robustness to grammatical errors, but that the correct combination of grammatical and ungrammatical training data has yet to be determined.
机译:本文介绍如何从格式正确的句子树中创建非语法句子的树库。树库创建过程包括将经常发生的语法错误自动引入到现有树库中的句子中,以及对树库中原始分析的最小转换,以便它们描述新创建的格式错误的句子。这样的树库可用于测试解析器能够忽略文本中的语法错误的程度(如人们所做的那样),并可用于引发能够分析此类句子的语法。本文使用Penn Treebank演示了这两个应用程序。在健壮性评估实验中,在Sect的非语法版本上评估了两个最新的统计解析器。宾夕法尼亚州树银行的《华尔街日报》(WSJ)部分的23。该实验表明,两个语法分析器的性能都会因语法噪声而下降。为两个解析器提供了按错误类型分类的细分。第二个实验使用《华尔街日报》第2-21节的非语法版本对两个解析器进行重新训练。该实验表明,非语法树库是提高语法分析器对语法错误的鲁棒性的有用资源,但是尚需确定语法和非语法训练数据的正确组合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号