A model-based approach for text clustering with outlier detection

机译：具有异常检测功能的基于模型的文本聚类方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.

机译：由于文本数据集的高维和大容量特征，文本聚类是一个具有挑战性的问题。在本文中，我们为用于文本聚类的Dirichlet过程多项式混合模型（缩写为GSDPMM）提出了一种折叠的Gibbs采样算法，该算法无需预先指定聚类的数量即可解决文本的高维问题聚类。我们广泛的实验研究表明，GSDPMM可以比其他三种聚类方法实现更好的性能，并且可以在长文本数据集和短文本数据集上实现高度一致性。我们发现GSDPMM的时间和空间复杂度较低，并且可以与庞大的文本数据集很好地缩放。我们还提出了一些新颖有效的方法来检测数据集中的异常值并获得每个聚类的代表词。

著录项

来源
《IEEE International Conference on Data Engineering》|2016年|625-636|共12页
会议地点
作者
Jianhua Yin; Jianyong Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Fuzzy clustering-based semi-supervised approach for outlier detection in big text data [J] . Farek Lazhar Progress in Artificial Intelligence . 2019,第1期

机译：基于模糊聚类的大文本数据远离异常检测的半导体方法
2. Generalised linear model-based algorithm for detection of outliers in environmental data and comparison with semi-parametric outlier detection methods [J] . Martina ?ampulová, Jaroslav Michálek, Ji?í Mou?ka Atmospheric Pollution Research . 2019,第4期

机译：基于线性模型的基于线性模型的算法，用于检测环境数据中的异常值和半参数异常检测方法的比较
3. A Mixture Model-Based Combination Approach for Outlier Detection [J] . Mohamed Bouguessa International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms . 2014,第4期

机译：基于混合模型的异常检测组合方法
4. A model-based approach for text clustering with outlier detection [C] . Jianhua Yin, Jianyong Wang IEEE International Conference on Data Engineering . 2016

机译：具有异常值检测的基于模型的文本聚类方法
5. Advances in Relationship Clustering and Outlier Detection [D] . Liu, Chang. 2021

机译：关系聚类和异常检测的进步
6. Outlier Identification in Model-Based Cluster Analysis [O] . Katie Evans, Tanzy Love, Sally W. Thurston -1

机译：基于模型的聚类分析中的异常值识别
7. An Outlier Detection Approach Based on Improved Self-Organizing Feature Map Clustering Algorithm [O] . Ping Yang, Dan Wang, Zhuojun Wei, 2019

机译：一种基于改进自组织特征映射聚类算法的异常检测方法
8. Fraud detection in medicare claims: A multivariate outlier detection approach [R] . Burr, T, Hale, C, Kantor, M 1997

机译：医疗保险索赔中的欺诈检测：多变量异常值检测方法

A model-based approach for text clustering with outlier detection

摘要

著录项

相似文献

相关主题

期刊订阅