...
首页> 外文期刊>International Journal of Innovative Research in Science, Engineering and Technology >PMI Based Clustering Algorithm for Feature Reduction in Text Classification
【24h】

PMI Based Clustering Algorithm for Feature Reduction in Text Classification

机译:基于PMI的文本分类中特征约简的聚类算法。

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Feature clustering is a feature reduction method that reduces the dimensionality of feature vectors for text classification. In this paper an incremental feature clustering approach is proposed that uses Semantic similarity to cluster the features. Pointwise Mutual Information (PMI) is widely used word similarity measure, which finds Semantic similarity between two words and is an alternative for distributional similarity. PMI computation requires simple statistics about two words for similarity measure, that is number of cooccurrences or correlations between two concepts of fixed size are computed. Once the words from preprocessed documents are fed, clusters are formed and one feature (head word) is identified for each cluster which are used for indexing the document. PMI assumes that a word have single sense, but clustering can be optimized further if polysemies of words are considered. Hence PMI may be combined with PMImax, which estimates correlation between the closest senses of two words also, thereby better feature reduction and execution time compared with other approaches.
机译:特征聚类是一种特征缩减方法,可减少用于文本分类的特征向量的维数。本文提出了一种利用语义相似度对特征进行聚类的增量特征聚类方法。点向互信息(Pointwise Mutual Information,PMI)是广泛使用的单词相似度度量,它可以发现两个单词之间的语义相似度,并且是分布相似度的替代方法。 PMI计算需要针对两个词的简单统计,以进行相似性度量,即计算固定大小的两个概念之间的共现次数或相关性。一旦馈送了来自预处理文档的单词,就形成了聚类,并为每个聚类标识了一个特征(标头),用于索引文档。 PMI假设一个单词具有单一含义,但是如果考虑单词的多义性,则可以进一步优化聚类。因此,PMI可以与PMImax结合使用,后者还可以估计两个单词的最接近感觉之间的相关性,从而与其他方法相比,可以更好地减少特征并缩短执行时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号