首页> 外文会议>2017 International Conference on Computer Systems, Electronics and Control >Chinese Web Short Text Subject Clustering Based on Similarity Upper Approximation
【24h】

Chinese Web Short Text Subject Clustering Based on Similarity Upper Approximation

机译:基于相似度较高近似的中文短文本主题聚类

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we propose a Web short text clustering method based on altered Similarity Upper Approximation algorithm. After the initial text modeling, we reduce the dimension of the text feature word matrix by singular value decomposition. After the clustering is completed, we extract the most frequent words in each text cluster to represent the subject of each cluster. The clustering process does not need to specify the number of clusters in advance, and it is suitable for Web short text clustering that is constantly updated and can not know the specific number of clusters in advance. In order to make the cluster number more accurate, we proposed to add the merger of clusters based on the average similarity of clusters and outlier detection in the original algorithm. Experiments show that the altered algorithm proposed in this paper is superior to the K-means algorithm and the hierarchical clustering algorithm in clustering accuracy and more accurate to original algorithm in cluster number.
机译:本文提出了一种基于改进的相似度较高近似算法的Web短文本聚类方法。在初始文本建模之后,我们通过奇异值分解来减少文本特征词矩阵的维数。聚类完成后,我们提取每个文本聚类中最频繁出现的单词来代表每个聚类的主题。群集过程不需要预先指定群集的数量,它适用于不断更新且无法预先知道群集特定数量的Web短文本群集。为了使聚类数更准确,我们建议在原始算法中基于聚类的平均相似度和离群值检测添加聚类合并。实验表明,本文提出的改进算法在聚类精度上优于K-means算法和分层聚类算法,在聚类数上优于原始算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号