首页> 外文期刊>Journal of Information Recording >Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter
【24h】

Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter

机译:Twitter中基于语料库的主题派生和基于时间戳的流行标签预测

获取原文
获取原文并翻译 | 示例
       

摘要

With the use of the Internet, mobile platforms, online commerce, and social media services, the footprints of human behavior can be easily recorded in the digital world, which generates data on an extremely large scale. Twitter as a big data social network becomes one of the most important sources for capturing up-to-date events happened in the world. Deriving topics from Twitter is important for various applications, such as situation awareness, market analysis, content filtering, and recommendations. However, topic derivation with high purity in Twitter is hard to achieve because tweets are limited to 140 characters. Previous works on topic derivation in Twitter suffer from low purity. In this paper, we propose corpus-based topic derivation (CTD) approach that combines a Twitter corpus and LF-LDA, which is a text processing model to identify topics and clusters of similar hashtags. We use asymmetric topic LF-LDA to obtain better purity of topics. Compared to intJNMF, a representative related work, the purity (F-measure) of our proposed CTD increases from 5.26% (27.81%) to 11.32% (34.28%) for 20 to 100 topics. We also propose a timestamp-based popular hashtags prediction (TPHP) approach by creating trending hashtags lists (THLs), which are lists of hashtags used by many users and make use of timestamps in tweets. We use the edit distance to find the difference between consecutive THLs. Then the difference can be used to calculate volatilety to find how people react to real world events. Compared to Hybrid+, a representative related work, the mean average precision (MAP) of our TPHP increases by 19.45% (week-day), 15.08% (week-week) and 16.95% (month-week).
机译:通过使用Internet,移动平台,在线商务和社交媒体服务,人类行为的足迹可以轻松记录在数字世界中,从而极大地生成数据。 Twitter作为大数据社交网络已成为捕获全球最新事件的最重要来源之一。从Twitter派生主题对于各种应用程序都很重要,例如情况意识,市场分析,内容过滤和建议。但是,由于推文被限制为140个字符,因此很难在Twitter中获得高纯度的主题派生。 Twitter上有关主题派生的以前的作品纯度低。在本文中,我们提出了一种基于语料库的主题派生(CTD)方法,该方法结合了Twitter语料库和LF-LDA,这是一种文本处理模型,用于识别主题和类似标签的聚类。我们使用不对称主题LF-LDA获得更好的主题纯度。与代表性的相关研究intJNMF相比,我们提出的CTD的纯度(F-度量)从206%至100个主题从5.26%(27.81%)增长至11.32%(34.28%)。我们还通过创建趋势标签列表(THL)来提出一种基于时间戳的流行标签预测(TPHP)方法,该标签是许多用户使用的标签列表,并在推文中使用了时间戳。我们使用编辑距离来查找连续THL之间的差异。然后,该差异可用于计算波动性,以发现人们对现实事件的反应。与具有代表性的相关工作Hybrid +相比,我们的TPHP的平均平均精度(MAP)提高了19.45%(工作日),15.08%(每周工作日)和16.95%(每月工作周)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号