特征提取是信息检索、文本分类、文本聚类以及自动文摘生成等技术的基础。针对传统的特征提取方法不能全面有效地考查待选特征词的缺点,提出了一种基于遗传算法优化综合启发式的中文网页特征提取方法。该方法通过词频、关联度、词性以及位置等多种启发式来综合考查待选特征,并利用遗传算法来优化各启发式的权重参数。通过在不同测试集上进行对比,实验结果表明,与传统方法相比,该方法能够有效避免传统特征提取方法产生的偏差,获得具有代表性的特征集,从而使得该方法具有一定的实用价值。%Feature extraction is the basis of such technologies as information retrieval , text classification , text clus-tering and automatic summarization .Aiming at the shortcomings of the traditional feature extraction methods which make it difficult to test feature words comprehensively and effectively , this paper proposes a method for extracting Chinese web page features by optimizing the comprehensive heuristic features based on GA .This proposed method employs comprehensive heuristics of word frequency , word correlation, parts of speech (POS) and position features to comprehensively test selected features and uses GA to optimize the weight of each heuristic parameter .The exper-imental results of the different test sets show that the proposed method can effectively avoid the derivations of the traditional extraction methods and obtain more representative features , and therefore it has a certain practical value .
展开▼