首页> 外文OA文献 >Detecting grammatical errors with treebank-induced, probabilistic parsers
【2h】

Detecting grammatical errors with treebank-induced, probabilistic parsers

机译:使用树库引发的概率解析器检测语法错误

摘要

Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements.
机译:当今的语法检查器经常使用手工制作的规则系统来定义可接受的语言。这种规则系统的开发是劳动密集型的,必须针对每种语言重复进行。同时,从句法标注的语料库(树库)自动导出的语法已成功应用于其他应用程序,例如文本理解和机器翻译。乍一看,树库诱发的语法似乎不适合进行语法检查,因为它们过度生成且由于其高鲁棒性而无法拒绝非语法输入。我们提出了三种新的判断概率的方法,这些概率是由树状诱发的概率语法证明的,这些语法可以成功地应用于自动判断输入字符串的语法。我们性能最好的方法利用了在语法和非语法树库上训练的语法的解析结果之间的差异。第二种方法是使用语法训练数据构建最可能解析的概率的估计器,该语法训练数据先前已被解析并用解析概率进行注释。如果输入句子的估计概率(其语法将由系统判断)比实际解析概率高出一定量,则将该句子标记为不符合语法。第三种方法从解析的语法和非语法语料库中提取CFG规则形式的判别式语法分析树片段,并训练一个二元分类器来区分语法和非语法句子。在语法和非语法句子的大型测试集上评估了这三种方法。通过将常见的语法错误插入到英国国家语料库中,可以自动生成非语法测试集。将结果与两种传统方法进行比较,一种是使用手工制作的判别语法,即XLE ParGram英文LFG,另一种是基于词性n-gram的。此外,将基线方法和新方法结合在基于机器学习的框架中,从而产生了进一步的改进。

著录项

  • 作者

    Wagner Joachim;

  • 作者单位
  • 年度 2012
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号