首页> 外文会议>IEEE International Conference on Data Engineering >A model-based approach for text clustering with outlier detection
【24h】

A model-based approach for text clustering with outlier detection

机译:具有异常检测功能的基于模型的文本聚类方法

获取原文

摘要

Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.
机译:由于文本数据集的高维和大容量特征,文本聚类是一个具有挑战性的问题。在本文中,我们为用于文本聚类的Dirichlet过程多项式混合模型(缩写为GSDPMM)提出了一种折叠的Gibbs采样算法,该算法无需预先指定聚类的数量即可解决文本的高维问题聚类。我们广泛的实验研究表明,GSDPMM可以比其他三种聚类方法实现更好的性能,并且可以在长文本数据集和短文本数据集上实现高度一致性。我们发现GSDPMM的时间和空间复杂度较低,并且可以与庞大的文本数据集很好地缩放。我们还提出了一些新颖有效的方法来检测数据集中的异常值并获得每个聚类的代表词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号