A keyword extraction algorithm for Chinese documents based on frequent pattern mining is proposed aiming at the problems of existing Keywords Extraction Algorithm(KEA) including high computational complexity and mining shallow semantic information. This algorithm adopts improved FP-Growth technology to extract word co-occurrence information and remove noisy words. It utilizes semantic similarity algorithm to eliminate synonyms and simplify the characteristics of candidates, thus reducing the storage space and the amount of calculation when ensuring the high precision and recall. Experimental results show that the average F value of corpus reaches 59.7%, which is higher than classical algorithms;and that the support threshold is the vital influencing factor.%针对现有关键词提取算法存在计算复杂、语义信息挖掘较浅等问题,提出一种基于频繁模式挖掘的中文关键词提取算法。该算法采用改进的 FP-增长算法挖掘词共现信息,排除噪音词汇;利用语义相似度算法消除同义词;精简候选词特征,在保证较高准确率和召回率的条件下减少了存储空间和计算量。实验结果表明,该算法所获得的平均 F值为59.7%,高于若干经典算法;支持度计数是最重要的影响因素。
展开▼