An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

首页> 外文期刊>Information Processing & Management >An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

【24h】

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

机译：对两个在线社交网络中的文档聚类和主题建模的评估：Twitter和Reddit

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

机译：在线社交网络（OSN）中的文档聚类和主题建模方法提供了一种分类，注释和理解大量用户生成内容的方法。多年来，已经开发了许多技术，从文本挖掘和聚类方法到潜在主题模型和神经嵌入方法。但是，当将这些方法应用于OSN数据时，其中许多方法的结果均较差，因为此类文本众所周知简短且嘈杂，并且结果在各个研究中通常无法比较。在这项研究中，我们评估了Twitter和Reddit的三个数据集上用于文档聚类和主题建模的几种技术。我们对从词频-反文档-频率（tf-idf）矩阵和词嵌入模型与四种聚类方法相结合得出的四种不同特征表示进行基准测试，并包括一个潜在的狄利克雷分配主题模型进行比较。文献中使用了几种不同的评估方法，因此我们提供了针对此任务的最适当外部措施的讨论和建议。我们还演示了该方法在具有不同文档长度的数据集上的性能。我们的结果表明，使用适当的外部评估措施，应用于神经嵌入特征表示的聚类技术在所有数据集上均提供了最佳性能。我们还演示了一种使用tf-idf权重结合嵌入距离度量的基于Top-words的方法来解释聚类的方法。

著录项

来源
《Information Processing & Management》 |2020年第2期|102034.1-102034.21|共21页
作者

展开▼
作者单位

Centre for Artificial Intelligence Faculty of Engineering and Information Technology University of Technology Sydney 15 Broadway Ultimo NSW 2007 Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Document clustering; Topic modelling; Topic discovery; Embedding models; Online social networks;

机译：文档聚类;主题建模;主题发现;嵌入模型;在线社交网络;
入库时间 2022-08-18 05:22:49

相似文献

外文文献
中文文献
专利

1. Understanding Weight Loss via Online Discussions: Content Analysis of Reddit Posts Using Topic Modeling and Word Clustering Techniques [J] . Yang Liu, Zhijun Yin Journal of medical Internet research . 2020,第6期

机译：通过在线讨论了解减肥：使用主题建模和Word聚类技术的Reddit帖子的内容分析
2. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling [J] . Alami Nabil, Meknassi Mohammed, En-nahnahi Noureddine, Expert systems with applications . 2021,第Juna期

机译：使用文档聚类和主题建模的自动阿拉伯文文本摘要无监督的神经网络
3. Temporal Topic-Based Multi-Dimensional Social Influence Evaluation in Online Social Networks [J] . Wang Feng, Li Jianbin, Jiang Wenjun, Wireless personal communications: An Internaional Journal . 2017,第3期

机译：基于时间主题的在线社交网络中的多维社会影响评估
4. Social Mood Extraction from Twitter Posts with Document Topic Model [C] . Ohmura Masahiro, Kakusho Koh, Okadome Takeshi International Conference on Information Science and Applications . 2014

机译：使用文档主题模型从Twitter帖子中提取社交情绪
5. An evaluation of the Technology Acceptance Model as a means of understanding online social networking behavior. [D] . Willis, Timothy J. 2008

机译：对技术接受模型的评估，作为理解在线社交网络行为的一种手段。
6. Towards a standard sampling methodology on online social networks: collecting global trends on Twitter [O] . C. A. Piña-García, Carlos Gershenson, J. Mario Siqueiros-García -1

机译：建立在线社交网络上的标准抽样方法：在Twitter上收集全球趋势
7. Predicting the Tendency of Topic Discussion on the Online Social Networks Using a Dynamic Probability Model [O] . Yadong Zhou, Xiaohong Guan, Zhefei Zhang, 2015

机译：用动态概率模型预测在线社交网络主题讨论的趋势

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

摘要

著录项

相似文献

相关主题

期刊订阅