首页> 外文学位 >The study on automatic Chinese collocation extraction.
【24h】

The study on automatic Chinese collocation extraction.

机译:中文自动搭配提取的研究。

获取原文
获取原文并翻译 | 示例

摘要

Collocation is a lexical phenomenon in which two or more words are habitually combined together as some conventional way of saying things. Collocation information is essential to many natural language processing tasks such as word sense disambiguation, machine translation, and information extraction. Most of current works on collocation extraction are statistical based with limited precision and recall because they cannot well distinguish word co-occurrences, which are statistically significant, from true collocations, which are of habitual use and are thus either syntactically or semantically significant.; The objective of this study is to investigate methods to improve the performance of collocation extraction algorithms. Different types of collocations are identified. Collocation extraction algorithms are then designed to target on different types of collocations using different features and criteria associated with these different types. In addition to improve statistical based collocation extraction algorithms, additional syntactic and semantic information are also incorporated into the algorithm to further improve the performance of collocation extraction.; On the study of the statistical based algorithms, a new algorithm based on bi-directional word bi-grams is designed to help identify collocations with low co-occurrence frequency and are of fixed use. A large scale collocation answer set is established so that collocation extraction algorithms can be evaluated and compared objectively by using the same training corpus and corresponding answer set. Collocations are then categorized into four types based on their compositionality, substitutability, and modifiability. Based on the characteristics of each type of collocations, a multi-stage window-based collocation extraction is built where the n-gram collocations and different types of bi-gram collocations are separately extracted in different stages using different strategies and different discriminative features.; A shallow Treebank, referred to as the PolyU Treebank, is annotated manually to provide syntactic and semantic knowledge to further help collocation extraction. This treebank is also used to train a chunker based on lexicalized Hidden Markov Model (HMM). The chunker provides ways to process running text for collocation extraction.; By using the support collocation patterns and reject collocation patterns extracted from the annotated treebank and parsed running text, syntactic features are employed to further improve the performance of the window-based collocation extraction system. Experimental results show that the use of syntactic patterns can significantly improve the performance of collocation extraction, especially for filtering pseudo collocations.; The extracted collocations were applied in the post-processing of a handwritten Chinese character recognition system. Experiments indicate that collocation information can be used in real application to improve the performance of these natural language related applications. It should be pointed out that this work focuses on collocation extraction of Chinese text. However, the techniques developed are applicable to other languages although separate annotations and understanding to different syntactical and semantics knowledge are needed.; Keyword: Natural language processing, collocation extraction, Treebank, Chunking and parsing.
机译:搭配是一种词汇现象,其中两个或多个单词习惯性地组合在一起,成为某些常规的说话方式。并置信息对于许多自然语言处理任务至关重要,例如词义消歧,机器翻译和信息提取。当前大多数关于搭配词提取的工作都是基于统计的,其精确度和回忆性有限,因为它们无法很好地区分具有统计学意义的单词共现与习惯使用的真实搭配,因此在语法上或语义上都是有意义的。这项研究的目的是研究提高搭配提取算法性能的方法。确定了不同类型的搭配。然后将搭配提取算法设计为使用与这些不同类型关联的不同特征和标准,针对不同类型的搭配。除了改进基于统计的搭配提取算法外,还将其他语法和语义信息合并到该算法中,以进一步提高搭配提取的性能。在对基于统计的算法进行研究的基础上,设计了一种基于双向词二元语法的新算法,以帮助识别低共现频率且固定使用的搭配。建立大规模的搭配答案集,以便可以使用相同的训练语料库和相应的答案集来客观地评估和比较搭配提取算法。然后根据搭配的组成,可替代性和可修改性将搭配分为四类。根据每种搭配类型的特点,建立了基于窗口的多阶段搭配提取方法,其中使用不同的策略和区分特征,在不同阶段分别提取n-gram搭配和不同类型的bi-gram搭配。手动注释浅树库(称为PolyU树库)以提供语法和语义知识,以进一步帮助搭配提取。该树库还用于基于词汇化的隐式马尔可夫模型(HMM)训练分块器。分块器提供了处理运行中的文本以进行搭配提取的方法。通过使用从带注释的树库和已解析的运行文本中提取的支持并置模式和拒绝并置模式,可以使用语法特征来进一步提高基于窗口的并置提取系统的性能。实验结果表明,使用句法模式可以显着提高搭配提取的性能,特别是对于过滤伪搭配。提取的搭配用于手写汉字识别系统的后处理。实验表明,并置信息可用于实际应用程序中,以提高这些自然语言相关应用程序的性能。应该指出的是,这项工作着重于中文文本的搭配提取。但是,尽管需要单独的注释以及对不同的句法和语义知识的理解,但是开发的技术也适用于其他语言。关键字:自然语言处理,搭配提取,树库,分块和解析。

著录项

  • 作者

    Xu, Ruifeng.;

  • 作者单位

    Hong Kong Polytechnic University (People's Republic of China).;

  • 授予单位 Hong Kong Polytechnic University (People's Republic of China).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 196 p.
  • 总页数 196
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号