...
首页> 外文期刊>International Journal of Knowledge Engineering and Data Mining >Clustering news articles using efficient similarity measure and N-grams
【24h】

Clustering news articles using efficient similarity measure and N-grams

机译:使用有效的相似性度量和N-gram对新闻文章进行聚类

获取原文
获取原文并翻译 | 示例
           

摘要

The rapid progress of information technology and web makes it easier to store huge amount of collected textual information, e.g., blogs, news articles, e-mail messages, reviews and forum postings. The growing size of textual dataset with high-dimensions and natural language pose a big challenge making it hard for such information to be categorised efficiently. Document clustering is an automatic unsupervised machine learning technique that aimed at grouping related set of items into clusters or subsets. The target is creating clusters with high internal coherence, but different from each other substantially. This paper presents a new document clustering technique using N-grams and efficient similarity measure known as 'improved sqrt-cosine similarity measure'. Comprehensive experiments are conducted to evaluate our proposed clustering technique and compared with an existing method. The results of the experiments show that our proposed clustering technique outperforms the existing techniques.
机译:信息技术和Web的飞速发展使得更容易存储大量收集的文本信息,例如博客,新闻文章,电子邮件,评论和论坛帖子。具有高维和自然语言的文本数据集的不断增长提出了巨大的挑战,使此类信息难以有效分类。文档聚类是一种自动无监督的机器学习技术,旨在将相关的项目集分组为聚类或子集。目标是创建具有高内部连贯性但彼此之间本质不同的集群。本文介绍了一种新的使用N-gram和有效相似度度量的文档聚类技术,称为“改进的sqrt-余弦相似度度量”。进行了综合实验以评估我们提出的聚类技术,并与现有方法进行比较。实验结果表明,我们提出的聚类技术优于现有技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号