首页> 中文期刊> 《信息网络安全》 >融合多特征的中文关键词提取方法

融合多特征的中文关键词提取方法

         

摘要

关键词提取是指是从文本中提炼出能够概括文献内容的词或词组。关键词提取是文本处理中的一项十分重要的关键技术,针对关键词提取受分词效果影响以及统计偏差等问题,提出了一种融合多特征的中文关键词提取方法。该方法通过考虑词频、词长、词性、位置、互联网词典、停用词典等6方面因素对关键词权重的影响,分别对这些因素提出了量化方案,再结合线性加权、组合词生成与过滤等技术进行关键词提取。文章实验中,采用从中国知网下载的包括环境、信息科学、交通、教育、经济、文史、化学、医药、农业、政治共10个类别论文的数据,论文中都含有作者自拟的关键词。实验结果表明,在候选词数量N为5的情况下,其关键词提取的近似匹配准确率为54.8%,召回率为65.1%。该方法不仅解决了关键词提取中受到分词影响而导致的召回率低的问题,而且能够针对文本中出现频率不高但是对于文本意义表达很重要的词进行提取,其提取的关键词在表达文本含义的方面要明显优于基于统计的方法,实用价值更大。%In text processing area, key words has become a critical technique for a long time. Key words extraction is aimed to extract the vital words or phrases which can summarize the literature content. Considering the influence of 6 factors (such as term frequency, term length, part of speech, position, internet-dictionary and stop word list) to the weight of keywords in text, we propose a new algorithm of Chinese keywords extraction in this paper. The proposed algorithm combines linear weighting, and compound word construction and ifltering. The experimental data consist of 10 categories of literature which are downloaded from China National Knowledge Infrastructure, namely environment, information technology, transportation, education, economics, culture and history, chemistry, medicine, agriculture and politics. The results show when the value of candidate words equals 5, the approximate matching precision is 54.8%, the recall rate is 65.1%. The proposed method can not only solves the problem of low recall coursed by word-segmentation in keyword extraction, but also extract words which are not high-frequency but important for the text meaning effectively.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号