首页> 外文期刊>Information Processing & Management >An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
【24h】

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

机译:对两个在线社交网络中的文档聚类和主题建模的评估:Twitter和Reddit

获取原文
获取原文并翻译 | 示例
       

摘要

Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
机译:在线社交网络(OSN)中的文档聚类和主题建模方法提供了一种分类,注释和理解大量用户生成内容的方法。多年来,已经开发了许多技术,从文本挖掘和聚类方法到潜在主题模型和神经嵌入方法。但是,当将这些方法应用于OSN数据时,其中许多方法的结果均较差,因为此类文本众所周知简短且嘈杂,并且结果在各个研究中通常无法比较。在这项研究中,我们评估了Twitter和Reddit的三个数据集上用于文档聚类和主题建模的几种技术。我们对从词频-反文档-频率(tf-idf)矩阵和词嵌入模型与四种聚类方法相结合得出的四种不同特征表示进行基准测试,并包括一个潜在的狄利克雷分配主题模型进行比较。文献中使用了几种不同的评估方法,因此我们提供了针对此任务的最适当外部措施的讨论和建议。我们还演示了该方法在具有不同文档长度的数据集上的性能。我们的结果表明,使用适当的外部评估措施,应用于神经嵌入特征表示的聚类技术在所有数据集上均提供了最佳性能。我们还演示了一种使用tf-idf权重结合嵌入距离度量的基于Top-words的方法来解释聚类的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号