传统的文本聚类缺少语义信息,文本的特征向量高维稀疏,忽略了Web文本的特殊性。为了解决这些问题,提出一种Web中文文本聚类方法。在基于知网(HowNet)的概念空间基础上过滤非名词,分析文本中重要词汇的语义,对标签特征集与正文特征集进行特征集聚类,再利用改进的TF-IDF算法选取两个集合中的特征,最终将文本表示为选取的标签特征集与正文特征集的并集,降低了特征的维度,高效地表示了文本。通过实验验证了其有效性。%Traditional text clustering lacks the semantic information , its text eigenvector is high-dimension sparse , and ignores the particularity of the Web text .In order to solve these problems , we propose a Web Chinese text clustering method in this paper .On the basis HowNet-base concept space , the method filters the terms but nouns , analyses the semantics of the important words in the text , and carry out the feature set clustering on label feature set and text feature set .Then it uses the improved TF-IDF algorithm to select features from these two sets, and finally expresses the text as a union of the selected label feature set and text feature set .It reduces the dimensions of features , and expresses the text efficiently .Experimental results demonstrate its effectiveness .
展开▼