首页> 外文会议>IEEE International Conference on Data Mining Workshops >Topic Extraction Method from Millions of Tweets Based on Fast Feature Selection Technique CWC
【24h】

Topic Extraction Method from Millions of Tweets Based on Fast Feature Selection Technique CWC

机译:基于快速特征选择技术的数百万条推文主题提取方法

获取原文

摘要

Social media offers a wealth of insight into how significant topics such as the Great East Japan Earthquake, the Arab Spring, and the Boston Bombing affect individuals. The scale of available data, however, can be intimidating: during the Great East Japan Earthquake, over 8 million tweets were sent each day from Japan alone. Conventional word vector-based topic-detection techniques for social media that use Latent Semantic Analysis, Latent Dirichlet Allocation, or graph community detection often cannot scale to such a large volume of data due to their space and time complexity. To alleviate this problem, we have already proposed an efficient method for topic extraction by leveraging our original fast feature selection algorithm, CWC, which vastly reduces the number of features to track. While we begin with word count vectors of authors and words for each time slot (in our case, every 30 minutes), we make clusters from each time slot by a matrix decomposition technique to identify clusters and adapt CWC to extract discriminative words from each cluster. This method makes it possible to detect topics from high dimensional datasets. In this paper, to demonstrate our method's effectiveness, we extract topics from a dataset of over two hundred million tweets sent following the Great East Japan Earthquake and compare them with the result extracted by LDA, the current most popular topic extraction method. With CWC, we can identify topics from this dataset with great speed and accuracy.
机译:社交媒体提供了丰富的见解,使人们可以了解东日本大地震,阿拉伯之春和波士顿轰炸等重要话题如何影响个人。但是,可用数据的规模可能令人生畏:在东日本大地震期间,仅日本每天就发送了超过800万条推文。使用潜在语义分析,潜在狄利克雷分配或图社区检测的传统基于社交媒体的基于单词矢量的主题检测技​​术,由于其时空复杂性,通常无法扩展到如此大量的数据。为了缓解这个问题,我们已经提出了一种有效的方法,以利用我们最初的快速特征选择算法CWC来进行主题提取,该算法大大减少了要追踪的特征数量。虽然我们从每个时隙(在本例中为每30分钟)的作者和单词的词数向量开始,但我们通过矩阵分解技术从每个时隙中建立聚类,以识别聚类,并使CWC适应性地从每个聚类中提取歧视性词。这种方法可以从高维数据集中检测主题。在本文中,为了证明我们方法的有效性,我们从东日本大地震后发送的超过2亿条推文的数据集中提取主题,并将其与LDA提取的结果(当前最受欢迎的主题提取方法)进行比较。借助CWC,我们可以快速,准确地从该数据集中识别主题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号