首页> 外文OA文献 >Scalable Topic Detection Approaches fromTwitter Streams
【2h】

Scalable Topic Detection Approaches fromTwitter Streams

机译:来自Twitter流的可扩展主题检测方法

摘要

Real time topic detection in Twitter streams is an important task that helps discovering natural disasters in a real time from users’ posts and helps political parties and companies understand users’ opinions and needs. In 2014 the number of active users on Twitter is reported to be more than 288 million users who are posting around 500 million tweets daily. Therefore, detecting topics from Twitter streams in a real time becomes a challenging task that needs scalable and efficient techniques to handle this large amount of data. In this work, we scale an Exemplar-based technique that detects topics from Twitter streams, where each of the detected topics is represented by one tweet (i.e, exemplar). Using exemplar tweets to represent the detected topics, makes these topics easier to interpret as opposed to representing them by uncorrelated terms as in other topic detection algorithms. The approach is implemented using Apache Giraph and is being extended here to efficiently support sliding windows. Experimental results on four datasets show that the optimized Giraph implementation achieves a speedup of up to nineteen times over thenative implementation, while maintaining good quality of the detected topics. In addition, Giraph Exemplar-based approach achieves the best topic recall and term precision against K-means, Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and Latent Semantic Analysis (LSA), while maintaining a good term recall and running time. The approach is also deployed for detecting topics from real-time Twitter streams and its scalability is demonstrated. Moreover, another clustering technique called Local Variance-based Clustering (LVC) is proposed in this thesis for detecting topics from Twitter streams. Local Variance-based Clustering (LVC) defines the data points densities based on their similarities. The proposed local variance measure is calculated based on the variance of the data points similarity histogram and is shown to well distinguish between core, border, connecting and outliers points. Experimental results show that LVC outperforms spectral clustering and affinity propagation in clustering quality using control charts, Ecoli and images datasets, while maintaining a good running time. In addition, results show that LVC can detect topics from Twitter with higher topic recall by 15% and higherterm precision by 3% over DBSCAN.
机译:Twitter流中的实时主题检测是一项重要任务,它有助于从用户的帖子中实时发现自然灾害,并帮助政党和公司了解用户的意见和需求。据报道,2014年Twitter上的活跃用户数量超过2.88亿用户,每天发布约5亿条推文。因此,从Twitter流中实时检测主题成为一项具有挑战性的任务,需要可扩展且高效的技术来处理大量数据。在这项工作中,我们扩展了一种基于示例的技术,该技术可从Twitter流中检测主题,其中每个检测到的主题都由一条推文(即示例)表示。与其他主题检测算法中一样,使用示例性推文表示检测到的主题,使这些主题更易于解释,而不是用不相关的术语表示。该方法是使用Apache Giraph实现的,并在此处进行了扩展以有效地支持滑动窗口。在四个数据集上的实验结果表明,优化的Giraph实现比传统实现实现了高达19倍的加速,同时保持了检测到的主题的良好质量。此外,基于Giraph Exemplar的方法可实现最佳的主题回忆和针对K均值,潜在狄利克雷分配(LDA),非负矩阵分解(NMF)和潜在语义分析(LSA)的术语精确度,同时保持良好的历史回忆和运行时间。该方法还部署用于从实时Twitter流中检测主题,并演示了其可伸缩性。此外,本文提出了另一种称为基于局部方差的聚类(LVC)的聚类技术,用于从Twitter流中检测主题。基于局部方差的聚类(LVC)基于它们的相似性定义数据点密度。根据数据点相似度直方图的方差计算出建议的局部方差量度,结果表明该方法可以很好地区分核心点,边界点,连接点和异常点。实验结果表明,在保持良好运行时间的同时,使用控制图,Ecoli和图像数据集,LVC的性能优于光谱聚类和亲和力传播。此外,结果表明,与DBSCAN相比,LVC可以检测到来自Twitter的主题,其主题召回率高15%,长期精确度高3%。

著录项

  • 作者

    Ibrahim Rania;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号