首页> 外文会议>ACM SIGKDD international conference on knowledge discovery and data mining;KDD 10 >Document Clustering via Dirichlet Process Mixture Model with Feature Selection
【24h】

Document Clustering via Dirichlet Process Mixture Model with Feature Selection

机译:通过具有特征选择的Dirichlet过程混合模型进行文档聚类

获取原文

摘要

One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.
机译:文档聚类的一个基本问题是为文档集合估计适当的聚类数,应将文档分区到该聚类中。在本文中,我们提出了一种新颖的方法,即DPMFS,来解决此问题。设计所提出的方法是:1)将文档分组为一组簇,而文档簇的数量由Dirichlet过程混合模型自动确定; 2)通过随机搜索变量选择技术来识别有区别的单词,并将它们与无关的噪音单词分开。我们在合成数据集和一些实际文档数据集上探索了我们提出的方法的性能。我们提出的方法与最先进的文档聚类方法之间的比较表明,我们的方法对于文档聚类是可靠且有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号