首页> 外文期刊>SIGKDD explorations >On-line Relevant Anomaly Detection in the Twitter Stream: An Efficient Bursty Keyword Detection Model
【24h】

On-line Relevant Anomaly Detection in the Twitter Stream: An Efficient Bursty Keyword Detection Model

机译:Twitter流中的在线相关异常检测:有效的突发关键字检测模型

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

On-line social networks have become a massive communication and information channel for users world-wide. In particular, the microblogging platform Twitter, is characterized by short-text message exchanges at extremely high rates. In this type of scenario, the detection of emerging topics in text streams becomes an important research area, essential for identifying relevant new conversation topics, such as breaking news and trends. Although emerging topic detection in text is a well established research area, its application to large volumes of streaming text data is quite novel. Making scalability, efficiency and rapidness, the key aspects for any emerging topic detection algorithm in this type of environment. Our research addresses the aforementioned problem by focusing on detecting significant and unusual bursts in keyword arrival rates or bursty keywords. We propose a scalable and fast on-line method that uses normalized individual frequency signals per term and a windowing variation technique. This method reports keyword bursts which can be composed of single or multiple terms, ranked according to their importance. The average complexity of our method is O(n log n), where n is the number of messages in the time window. This complexity allows our approach to be scalable for large streaming datasets. If bursts are only detected and not ranked, the algorithm remains with lineal complexity O(n), making it the fastest in comparison to the current state-of-the-art. We validate our approach by comparing our performance to similar systems using the TREC Tweet 2011 Challenge tweets, obtaining 91% of matches with LDA, an off-line gold standard used in similar evaluations. In addition, we study Twitter messages related to the SuperBowl football events in 2011 and 2013.
机译:在线社交网络已成为世界范围内用户的大规模通信和信息渠道。尤其是微博平台Twitter,其特征在于以极高的速率进行短消息交换。在这种情况下,文本流中新兴主题的检测成为重要的研究领域,对于识别相关的新对话主题(如突发新闻和趋势)至关重要。尽管新兴的文本主题检测是一个完善的研究领域,但其在大量流文本数据中的应用还是很新颖的。使可伸缩性,效率和快速性成为此类环境中任何新兴主题检测算法的关键方面。我们的研究通过专注于检测关键字到达率或突发关键字的重大和异常突发来解决上述问题。我们提出一种可扩展且快速的在线方法,该方法使用每项归一化的单个频率信号和开窗变化技术。此方法报告关键字突发,它可以由单个或多个术语组成,并根据其重要性进行排名。我们的方法的平均复杂度为O(n log n),其中n是时间窗口中的消息数。这种复杂性使我们的方法可以扩展到大型流数据集。如果仅检测到突发而未对突发进行排序,则该算法将保持线性复杂度O(n),与当前的最新技术相比,它是最快的。通过使用TREC Tweet 2011挑战性推文将我们的性能与类似系统进行比较,我们验证了我们的方法,并获得了与LDA(在类似评估中使用的离线黄金标准)匹配度的91%。此外,我们研究与2011年和2013年超级碗足球赛事有关的Twitter消息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号