基于Canopy+K-means的中文文本聚类算法

张琳; 牟向伟

首页> 中文期刊> 《图书馆论坛》 >基于Canopy+K-means的中文文本聚类算法

基于Canopy+K-means的中文文本聚类算法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

随着互联网的发展,网络电子文本的数量急剧增加,给人们快速高效地从海量数据中挖掘出所需要的信息带来了巨大挑战.文本聚类是解决这个问题的一种可行方法.文章在文本聚类的过程中,针对K-means算法在聚类时需要事先指定簇的个数k和k个初始中心点这两方面的不足,采用Canopy+K-means的聚类算法进行中文文本聚类.为了提高K-means的聚类效果,先使用Canopy算法对数据进行"粗"聚类,在得到k值和聚类中心后,再使用K-means算法进行"细"聚类.在聚类过程中,为了避免"维灾难"现象,本文基于Word2vec通过获得同义词或近义词来有效减少文本特征向量的维度.实验结果表明,基于Canopy+K-means的聚类效果比传统的K-means算法有较好的纯度、准确率、召回率和F值.%With the development of the Internet,the amount of electronic texts has increased dramatically,which brings a huge challenge for people to dig out required information from the massive data.Text clustering is a feasible method to solve this problem,and K-means is a common algorithm for text clustering,which requires the researcher to specify the number of clusters k first and has sensitivity to the initial cluster centers. In order to improve the text clustering efficiency,Canopy+K-means algorithm is adopted.In detail,first Canopy is used for a"coarse"clustering,and then K-means algorithm is used for a"fine"clustering.In addition,Word2vec is used to obtain synonyms so as to reduce effectively the dimension of text feature vectors.It is found out that compared with K-means,Canopy+K-means obtains higher purity,precision,recall and F values.

著录项

来源
《图书馆论坛》 |2018年第6期|113-119|共7页
作者
张琳; 牟向伟;
展开▼
作者单位

大连海事大学航运经济与管理学院;

大连海事大学航运经济与管理学院;

展开▼
原文格式 PDF
正文语种 chi
中图分类
关键词
K-means; Canopy; 文本聚类; Word2vec;

相似文献

中文文献
外文文献
专利

1. 基于语义簇的中文文本聚类算法 [J] . 齐向明 ,孙煦骄 . 吉林大学学报（理学版） . 2019,第005期
2. 基于复杂网络理论的中文文本聚类算法 [J] . 李培 . 电子测试 . 2014,第003期
3. 基于语境和语义的中文文本聚类算法研究 [J] . 吴勇 ,周军 . 科技信息 . 2010,第035期
4. 基于语义列表的中文文本聚类算法 [J] . 马素琴 ,施化吉 ,李星毅 . 计算机应用研究 . 2010,第005期
5. 一种基于《知网》的中文文本聚类算法的研究 [J] . 赵鹏 ,蔡庆生 . 计算机工程与应用 . 2007,第012期
6. 基于《中图法》的多层次中文文本分类知识库的构建 [C] . Zhang Yufang ,张玉芳 ,Xue Chunxiang . 第四届全国知识组织与知识链接学术交流会 . 2013
7. 基于语义簇的中文文本聚类算法研究 [A] . 孙煦骄 . 2019

基于Canopy+K-means的中文文本聚类算法

摘要

著录项

相似文献

相关主题

期刊订阅