【24h】

Discovering Compound and Proper Nouns

机译:发现复合名词和专有名词

获取原文
获取原文并翻译 | 示例

摘要

The identification of appropriate text tokens (words or sequences of words representing concepts) is one of the most important tasks of text preprocessing and may have great influence on the final results of text analysis. In our paper, we introduce a new approach to discovering compound nouns, including proper compound nouns. Our approach combines the data mining methods with shallow lexical analysis. We propose a simple pattern language for specifying grammatical patterns to be satisfied by extracted compound nouns. Our method requires annotating the words with part of speech tags, thus to this extent, it is language-dependent. Based on the data mining GSP algorithm, we propose T-GSP as its modification for extracting frequent text patterns, and in particular, frequent word sequences that satisfy given grammatical rules. The obtained sequences are regarded as candidates for compound nouns. The experiments have proven very high quality of the method.
机译:适当的文本标记(代表概念的单词或单词序列)的标识是文本预处理的最重要任务之一,并且可能对文本分析的最终结果产生重大影响。在本文中,我们介绍了一种发现复合名词(包括专有复合名词)的新方法。我们的方法将数据挖掘方法与浅层词法分析相结合。我们提出了一种简单的模式语言,用于指定提取的复合名词要满足的语法模式。我们的方法需要使用部分语音标签来注释单词,因此在某种程度上取决于语言。基于数据挖掘GSP算法,我们提出T-GSP作为其改进,用于提取频繁的文本模式,尤其是满足给定语法规则的频繁单词序列。所获得的序列被视为复合名词的候选。实验证明该方法的质量很高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号