首页> 外文OA文献 >Treebank vs. xbar-based automatic f-structure annotation
【2h】

Treebank vs. xbar-based automatic f-structure annotation

机译:Treebank与基于xbar的自动f结构注释

摘要

Manual, large scale (computational) grammar development is time consuming, expensive and requires lots of linguistic expertise. More recently, a number of alternatives based on treebank resources (such as Penn-II, Susanne, AP treebank) have been explored. The idea is to automatically ``induce'' or rather read off (P)CFG grammars from the parse annotated treebank resources and to use the treebank grammars thus obtained in (probabilistic) parsing or as a starting point for further grammar development. The approach is cheap, fast, automatic, large scale, ``data driven'' and based on real language resources.ududTreebank grammars typically involve large sets of lexical tags and non-lexical categories as syntactic information tends to be encoded in monadic category symbols. They feature flat rules (trees) that can ``underspecify'' attachment possibilities. Treebank grammars do not in general follow Xbar architectural design principles (this is not to say that treebank grammars do not have design principles). As a consequence, treebank grammars tend to have very large CFG rule bases (e.g. Penn-II > 17,000 CFG rules for about 1 million words of text) with often only minimally differing rules. Even though treebank grammars are large, they are still incomplete, exhibiting unabated rule accession rates. From a grammar engineering point of view, the size of the rule base poses problems for maintainability, extendability and, if a treebank grammar is to be used as a CF-base in a LFG grammar, for functional (feature-structure) annotations. From the point of view of theoretical linguistics, flat treebank trees and treebank grammars extracted from such trees do not express linguistic generalisations. From the perspective of empirical and corpus linguistics, flat trees are well-motivated as they allow underspecification of subtle and often time consuming attachment decisions. Indeed, it is sometimes doubted whether highly general Xbar schemata usefully scale to ``real'' language.ududIn previous work we developed methodologies for automatic feature-structure annotation of grammars extracted from treebanks. Automatic annotation of ``raw'' treebank grammars is difficult as annotation rules often need to identify subsequences in the RHSs of flat treebank rules as they explicitly encode head, complement and modifier relations. Xbar based CFG rules should substantially facilitate automatic feature-structure annotation of grammar rules.ududIn the present paper we conduct a number of experiments to explore a space of possible grammars based on a small fragment of the AP treebank resource. Starting with the original treebank fragment we automatically extract a CFG G. We then apply an automatic structure preserving grammar compaction step which generalises categories in the original treebank fragment and reduces the number of rules extracted, resulting in a generalised treebank fragment and in a compacted grammar Gc. The generalised fragment is then manually corrected to catch missed constituents (and the like) resulting in an automatically extracted, compacted and (effectively manually) corrected grammar Gc,m. Manual correction proceeds in the ``spirit'' of treebank grammars (we do not introduce Xbar analyses). We then explore how many of the manual correction steps on treebank trees can be achieved automatically. We develop, implement and test an automatic treebank ``grooming'' methodology which is applied to the generalised treebank fragment to yield a compacted and automatically corrected grammar Gc,a. Grammars Gc,m and Gc,a are very similar to compiled out ``flat'' LFG-82 style grammars. We explore regular expression based compaction (both manual and automatic) to relate Gc,m to a LFG-82 style grammar design. Finally, we manually recode a subsection of the generalised and manually corrected treebank fragment into ``vanilla-flavour'' XBar based trees. From these we extract a compacted, manually corrected, XBar based grammar Gc,m,x. We evaluate our grammars and methods using standard labelled bracketing measures and according to how well they perform under automatic feature-structure annotation tasks.
机译:手动进行大规模(计算)语法开发非常耗时,昂贵并且需要大量的语言专业知识。最近,已经探索了许多基于树库资源的替代方法(例如Penn-II,Susanne,AP树库)。这个想法是自动从解析的带注释的树库资源中``诱导''或更确切地说从(P)CFG语法中读取,并使用在(概率)解析中获得的树库语法或作为进一步语法开发的起点。该方法便宜,快速,自动,大规模,``数据驱动''并基于真实语言资源。 ud udTreebank语法通常涉及大量词法标签和非词法类别,因为语法信息倾向于以单子目录符号。它们以扁平规则(树)为特征,这些规则可以``未充分说明''附着的可能性。树库语法通常不遵循Xbar建筑设计原则(这并不是说树库语法没有设计原则)。结果,树库语法倾向于具有非常大的CFG规则库(例如,Penn-II> 17,000 CFG规则,用于大约一百万个文字),且规则差异通常很小。即使树库语法很大,它们仍然不完整,显示出规则加入率未减的情况。从语法工程学的角度来看,规则库的大小对可维护性,可扩展性以及(如果树库语法要用作LFG语法中的CF库)功能(特征结构)注释构成问题。从理论语言学的角度来看,平坦的树库树和从树上提取的树库文法不能表达语言上的概括。从经验和语料库语言学的角度来看,扁平树具有良好的动机,因为它们允许对细微且通常耗时的附件决策进行规格不足。确实,有时会怀疑高度通用的Xbar模式是否可以有效地扩展到``真实''语言。 ud ud在先前的工作中,我们开发了用于从树库中提取语法的自动特征结构注释的方法。自动注释``原始''树库语法很困难,因为注释规则通常需要在平面树库规则的RHS中标识子序列,因为它们明确编码头,补码和修饰符关系。基于Xbar的CFG规则应在很大程度上促进语法规则的自动特征结构注释。 ud ud在本文中,我们进行了一些实验,以基于AP树库资源的一小部分探索可能的语法空间。从原始树库片段开始,我们会自动提取CFGG。然后应用自动保留结构的语法压缩步骤,该步骤会概括原始树库片段中的类别并减少提取的规则数量,从而生成广义树库片段和压缩语法GC。然后,手动校正广义片段,以捕获遗漏的成分(等),从而自动提取,压缩和(有效手动)校正语法Gc,m。手动校正是在树库语法的``精神''中进行的(我们不介绍Xbar分析)。然后,我们探索可以自动实现对树岸树木执行多少手动校正步骤。我们开发,实施和测试了自动的树库``修饰''方法,该方法适用于广义树库片段,以产生紧凑且自动校正的语法Gc,a。语法Gc,m和Gc,a非常类似于已编译的``扁平''LFG-82样式语法。我们探索基于正则表达式的压缩(手动和自动),以将Gc,m与LFG-82样式的语法设计相关联。最后,我们将广义和手动校正的树库片段的一个子部分手动重新编码为基于``香草味''XBar的树。从这些中,我们提取出一个压缩的,手动校正的,基于XBar的语法Gc,m,x。我们使用标准的带标签的包围措施并根据其在自动特征结构注释任务下的表现如何来评估语法和方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号