【24h】

A Dirichlet process biterm-based mixture model for short text stream clustering

机译:基于Dirichlet处理Biterm的简短文本流群集的混合模型

获取原文
获取原文并翻译 | 示例
           

摘要

Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics.
机译:短文本流群集已成为在不同社交媒体平台中采集文本数据的重要问题(例如,Twitter)。然而,大多数现有的聚类方法(例如,LDA和PLSA)是基于长文本的静态语料库的假设开发的,而已经对短文本流进行了很少的关注。与长篇文本不同,短文本的聚类是更具挑战性,因为他们的单词共同发生模式容易受到稀疏问题。在本文中,我们提出了一种基于Dirichlet处理Bitric的混合模型(DP-BMM),其可以处理漂移问题和短文本流群集中的稀疏问题。 DP-BMM的主要优点包括(1)DP-BMM明确地利用每个文档构造的字对对,以增强短文本中的单词共同发生模式; (2)DP-BMM可以自然地处理短文本流的主题漂移问题。此外,我们进一步提出了一种改进的DP-BMM算法,忘记属性称为DP-BMM-FP,这可以通过删除过时批次的群集来有效地删除过时的文档。要执行推理,我们采用了用于参数估计的在线GIBBS采样方法。我们对现实世界数据集的广泛实验结果表明,DP-BMM和DP-BMM-FP可以在NMI指标方面实现比最先进的方法更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号