首页> 中文期刊>计算机技术与发展 >一种基于MinHash的改进新闻文本聚类算法

一种基于MinHash的改进新闻文本聚类算法

     

摘要

The continuous development of information technology has brought about the rapid growth of news texts on the Internet.In the face of a large number of news texts, it is very important to cluster them effectively.Based on the above requirements, we propose an improved DBSCAN clustering algorithm based on MinHash.In order to solve the problem of high data dimension, high computational complexity and large resource consumption in traditional vector space model text clustering, this algorithm uses MinHash to reduce the dimension of all text feature word sets, thus effectively reducing the wastes of resources.Jaccard coefficient is calculated for any two-by-two data in the obtained characteristics matrix, and each result is compared with the neighborhood radius Eps in DBSCAN clustering and calculated whether all the neighboring nodes whose distances are greater than the neighborhood radius Eps is greater than or equal to MinPts.Therefore, we can determine whether the text is a core point and whether clusters can be formed.Experiment shows that the algorithm has a better effect on news text clustering and can effectively cluster the intricate news text on the Internet.%信息技术的不断发展, 带来的是网络上新闻文本的快速增长, 面对大量的新闻文本, 对其进行有效聚类就显得十分重要.基于上述需求, 提出一种基于MinHash的DBSCAN聚类算法.针对传统向量空间模型文本聚类存在的数据维度高、计算复杂度大、资源消耗多的问题, 该算法使用Min Hash对所有文本的文本特征词集合进行降维, 从而有效减少了资源的浪费.对新得到的特征矩阵中的数据任意两两计算Jaccard系数, 将每一个结果与DBSCAN聚类中给定的邻域半径Eps进行比较并计算所有距离大于邻域半径Eps的点的周围节点数目是否大于等于形成一个簇所需要的最小点数MinPts, 由此可以判断该文本是否为核心点, 是否可以形成簇.实验结果表明, 该方法对于新闻文本聚类有着很好的效果, 可以对网络上错综复杂的新闻文本进行有效的聚类.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号