首页> 外文会议>Aisa-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data >Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model
【24h】

Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

机译:通过Dirichlet多项式分配模型改善长文档对短文本的文档聚类

获取原文

摘要

Document clustering for short texts has received considerable interest. Traditional document clustering approaches are designed for long documents and perform poorly for short texts due to the their sparseness representation. To better understand short texts, we observe that words that appear in long documents can enrich short text context and improve the clustering performance for short texts. In this paper, we propose a novel model, namely DDMAfs, which (1) improves the clustering performance of short texts by sharing structural knowledge of long documents to short texts; (2) automatically identifies the number of clusters; (3) separates discriminative words from irrelevant words for long documents to obtain high quality structural knowledge. Our experiments indicate that the DDMA/s model performs well on the synthetic dataset and real datasets. Comparisons between the DDMA/s model and state-of-the-art short text clustering approaches show that the DDMA/s model is effective.
机译:短文本的文档聚类已经引起了极大的兴趣。传统的文档聚类方法是针对长文档而设计的,由于其稀疏表示,因此对于短文本而言效果较差。为了更好地理解短文本,我们观察到长文档中出现的单词可以丰富短文本上下文并提高短文本的聚类性能。在本文中,我们提出了一种新颖的模型DDMAfs,该模型(1)通过将长文档的结构知识共享给短文本来提高短文本的聚类性能; (2)自动识别簇数; (3)将辨别词和无关词分开,以获取较长的文档,从而获得高质量的结构知识。我们的实验表明,DDMA / s模型在合成数据集和真实数据集上表现良好。 DDMA / s模型与最新的短文本聚类方法之间的比较表明,DDMA / s模型是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号