首页> 外文会议>Pacific Asia Conference on Language, Information and Computation >TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models

TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models




This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (AMs) as filters. There are two main purposes for the design of this hybrid algorithm: (1) to maintain a reasonable recall while improving the precision, and (2) to investigate the proposed association measures on Chinese noun phrase collocations. The performance is compared with a pure statistical model and a pure rule-based method on a 60MB PoS tagged corpus. The experiment results show that the proposed hybrid method has a higher precision of 92.65% and recall of 47% based on 29 randomly selected noun headwords compared with the precision of 78.87% and recall of 27.19% of a statistics based extraction system. The F-score improvement is 55.7%.
机译:本文提出了一种用于提取与基于规则的语言知识的统计模型来提取统计模型的混合方法。该算法首先通过使用短语规则形式的语法知识来提取来自浅析语料库的所有名词短语伴侣。然后,它通过使用基于统计的关联度量(AMS)作为过滤器来删除伪搭配。这种混合算法的设计有两种主要目的:(1)保持合理的召回,同时改进精度,(2)调查中国名词短语展示的拟议关联措施。将性能与纯统计模型和基于纯规则的方法进行比较,在60MB POS标记的语料库上。实验结果表明,基于29个随机选择的NOUN百字数,拟议的杂化方法具有92.65%的更高精度为92.65%,召回47%,而是基于78.87%的精度,召回了基于统计的提取系统的27.19%。 F评分提高为55.7%。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号