Feature selection in Uyghur text clustering directly affects the clustering efficiency and effect. In this paper, according to the word formation rules in Uyghur language and on the basis of original document frequency-based feature selection algorithm, we put forward a new feature extraction algorithm of Uyghur text clustering. The new method takes stem as the feature item of a text, fuses feature contribution-based selection method to original algorithm, and uses Java language to implement a Uyghur text clustering system. The system is used to the experiment of artificial classified text set, result shows that the new feature extraction algorithm efficiently decreases the dimension of the text vector space, and improves in different extent the indexes of accuracy, recalling rate and F-Measure.%维吾尔语文本聚类中特征选择对聚类的效率和效果都有直接影响.根据维吾尔语构词法规律,在原有基于文档频率特征选择算法基础上,提出新的维吾尔语文本聚类的特征提取算法.新方法将词干作为文本的特征项,在原算法上融合了基于特征贡献度的选择方法,并使用Java语言实现了一个维吾尔语文本聚类系统.使用该系统在人工分类的文本集上进行实验,结果表明:新的特征提取算法有效地降低了文本向量空间维度,在准确率、召回率和F-Measure等指标方面均有不同程度提高.
展开▼