
Discovering Compound and Proper Nouns


获取原文并翻译 | 示例


The identification of appropriate text tokens (words or sequences of words representing concepts) is one of the most important tasks of text preprocessing and may have great influence on the final results of text analysis. In our paper, we introduce a new approach to discovering compound nouns, including proper compound nouns. Our approach combines the data mining methods with shallow lexical analysis. We propose a simple pattern language for specifying grammatical patterns to be satisfied by extracted compound nouns. Our method requires annotating the words with part of speech tags, thus to this extent, it is language-dependent. Based on the data mining GSP algorithm, we propose T-GSP as its modification for extracting frequent text patterns, and in particular, frequent word sequences that satisfy given grammatical rules. The obtained sequences are regarded as candidates for compound nouns. The experiments have proven very high quality of the method.



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号