Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

机译：通过Dirichlet多项式分配模型改善长文档对短文本的文档聚类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Document clustering for short texts has received considerable interest. Traditional document clustering approaches are designed for long documents and perform poorly for short texts due to the their sparseness representation. To better understand short texts, we observe that words that appear in long documents can enrich short text context and improve the clustering performance for short texts. In this paper, we propose a novel model, namely DDMAfs, which (1) improves the clustering performance of short texts by sharing structural knowledge of long documents to short texts; (2) automatically identifies the number of clusters; (3) separates discriminative words from irrelevant words for long documents to obtain high quality structural knowledge. Our experiments indicate that the DDMA/s model performs well on the synthetic dataset and real datasets. Comparisons between the DDMA/s model and state-of-the-art short text clustering approaches show that the DDMA/s model is effective.

机译：短文本的文档聚类已经引起了极大的兴趣。传统的文档聚类方法是针对长文档而设计的，由于其稀疏表示，因此对于短文本而言效果较差。为了更好地理解短文本，我们观察到长文档中出现的单词可以丰富短文本上下文并提高短文本的聚类性能。在本文中，我们提出了一种新颖的模型DDMAfs，该模型（1）通过将长文档的结构知识共享给短文本来提高短文本的聚类性能; （2）自动识别簇数; （3）将辨别词和无关词分开，以获取较长的文档，从而获得高质量的结构知识。我们的实验表明，DDMA / s模型在合成数据集和真实数据集上表现良好。 DDMA / s模型与最新的短文本聚类方法之间的比较表明，DDMA / s模型是有效的。

著录项

来源
《Aisa-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data》|2017年|626-641|共16页
会议地点
作者
Yingying Yan; Ruizhang Huang; Can Ma; Liyang Xu; Zhiyuan Ding; Rui Wang; Ting Huang; Bowei Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Short text clustering; Dirichlet multinomial allocation; Gibbs sampling algorithm;

机译：短文本聚类; Dirichlet多项式分配;吉布斯采样算法;
入库时间 2022-08-26 13:48:45

相似文献

外文文献
中文文献
专利

1. Social-Child-Case Document Clustering based on Topic Modeling using Latent Dirichlet Allocation [J] . Nur Annisa Tresnasari, Teguh Bharata Adji, Adhistya Erna Permanasari Indonesian Journal of Computing and Cybernetics Systems . 2020,第2期

机译：基于主题建模的社会儿童案例群体使用潜像潜像级分配
2. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering [J] . Jianhua Yin, Jianyong Wang SIGKDD explorations . 2014,第CDaROM期

机译：基于Dirichlet多项式混合模型的短文本聚类方法
3. DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering [J] . Lakshmi R., Baskar S. Journal of Information Science . 2019,第6期

机译：DIC-DOC-K-means：使用K-means的DOCument聚类基于不相似性的初始质心选择，以提高文本文档聚类的效率
4. Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model [C] . Yingying Yan, Ruizhang Huang, Can Ma, Asia Pacific Web and Web-Age Information Management . 2017

机译：通过Dirichlet多项分配模型改进长文本的短文本的文档群集
5. A comparative study on ontology generation and text clustering using VSM, LSI, and document ontology models. [D] . Taylor, William P., II. 2007

机译：使用VSM，LSI和文档本体模型进行本体生成和文本聚类的比较研究。
6. Swarm Intelligence Algorithms in Text Document Clustering with Various Benchmarks [O] . Suganya Selvaraj, Eunmi Choi 2021

机译：文本文档集群中的群智能算法与各种基准
7. Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering [O] . Ruizhang Huang, Weijia Xu, Yongbin Qin, 2020

机译：多源文档聚类的分层Dirichlet多项分配模型

Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

摘要

著录项

相似文献

相关主题

期刊订阅