Topic Extraction Method from Millions of Tweets Based on Fast Feature Selection Technique CWC

机译：基于快速特征选择技术的数百万条推文主题提取方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Social media offers a wealth of insight into how significant topics such as the Great East Japan Earthquake, the Arab Spring, and the Boston Bombing affect individuals. The scale of available data, however, can be intimidating: during the Great East Japan Earthquake, over 8 million tweets were sent each day from Japan alone. Conventional word vector-based topic-detection techniques for social media that use Latent Semantic Analysis, Latent Dirichlet Allocation, or graph community detection often cannot scale to such a large volume of data due to their space and time complexity. To alleviate this problem, we have already proposed an efficient method for topic extraction by leveraging our original fast feature selection algorithm, CWC, which vastly reduces the number of features to track. While we begin with word count vectors of authors and words for each time slot (in our case, every 30 minutes), we make clusters from each time slot by a matrix decomposition technique to identify clusters and adapt CWC to extract discriminative words from each cluster. This method makes it possible to detect topics from high dimensional datasets. In this paper, to demonstrate our method's effectiveness, we extract topics from a dataset of over two hundred million tweets sent following the Great East Japan Earthquake and compare them with the result extracted by LDA, the current most popular topic extraction method. With CWC, we can identify topics from this dataset with great speed and accuracy.

机译：社交媒体提供了丰富的见解，使人们可以了解东日本大地震，阿拉伯之春和波士顿轰炸等重要话题如何影响个人。但是，可用数据的规模可能令人生畏：在东日本大地震期间，仅日本每天就发送了超过800万条推文。使用潜在语义分析，潜在狄利克雷分配或图社区检测的传统基于社交媒体的基于单词矢量的主题检测技术，由于其时空复杂性，通常无法扩展到如此大量的数据。为了缓解这个问题，我们已经提出了一种有效的方法，以利用我们最初的快速特征选择算法CWC来进行主题提取，该算法大大减少了要追踪的特征数量。虽然我们从每个时隙（在本例中为每30分钟）的作者和单词的词数向量开始，但我们通过矩阵分解技术从每个时隙中建立聚类，以识别聚类，并使CWC适应性地从每个聚类中提取歧视性词。这种方法可以从高维数据集中检测主题。在本文中，为了证明我们方法的有效性，我们从东日本大地震后发送的超过2亿条推文的数据集中提取主题，并将其与LDA提取的结果（当前最受欢迎的主题提取方法）进行比较。借助CWC，我们可以快速，准确地从该数据集中识别主题。

著录项

来源
《IEEE International Conference on Data Mining Workshops》|2016年|724-731|共8页
会议地点
作者
Takako Hashimoto; Dave Shepard; Tetsuji Kuboyama; Kilho Shin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Feature extraction; Earthquakes; Data mining; Matrix decomposition; Social network services; Time series analysis; Electronic mail;

机译：特征提取;地震;数据挖掘;矩阵分解;社交网络服务;时间序列分析;电子邮件;

相似文献

外文文献
中文文献
专利

1. Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction [J] . Li Mengmeng, Wang Haofeng, Yang Lifang, Expert systems with applications . 2020,第Jula期

机译：基于特征选择和分组特征提取的分类的快速混合维度减少方法
2. Paragraph Selection Methods Using Feature-Based on Segment-Based Clustering Process Using Paragraphs for Identifying Topics on Indication Detection of Plagiarism System [J] . Denar Regata Akbi, Arini Rahmawati Rosyadi Kinetik . 2018,第2期

机译：基于特征的段落选择方法基于段的聚类过程使用段落识别of窃系统指示检测的主题
3. A comparison of feature extraction strategies using wavelet dictionaries and feature selection methods for single trial P300-based BCI [J] . Acevedo R., Atum Y., Gareis I., Medical and Biological Engineering and Computing: Journal of the International Federation for Medical and Biological Engineering . 2019,第3期

机译：基于单次试验P300的小波词典的特征提取策略的比较和特征选择方法
4. Topic Extraction Method from Millions of Tweets Based on Fast Feature Selection Technique CWC [C] . Takako Hashimoto, Dave Shepard, Tetsuji Kuboyama, IEEE International Conference on Data Mining Workshops . 2016

机译：基于快速特征选择技术CWC的数百万推文的主题提取方法
5. Pattern recognition and feature extraction using lidar-derived elevation models in GIS: A comparison between visualization techniques and automated methods for identifying prehistoric ditch-fortified sites in North Dakota [D] . Radermacher, Matthew Jeffery. 2016

机译：使用GIS中基于激光雷达的高程模型进行模式识别和特征提取：可视化技术与识别北达科他州史前沟壑加固地点的自动化方法之间的比较
6. DectICO: an alignment-free supervised metagenomic classification method based on feature extraction and dynamic selection [O] . Xiao Ding, Fudong Cheng, Changchang Cao, 2015

机译：DectICO：基于特征提取和动态选择的无对准监督宏基因组分类方法
7. A Fast and Intelligent Open-Circuit Fault Diagnosis Method for a Five-Level NNPP Converter Based on an Improved Feature Extraction and Selection Model [O] . Shu Ye, Jianguo Jiang, Zhongzheng Zhou, 2020

机译：一种基于改进的特征提取和选择模型的五级NNPP转换器的快速智能开路故障诊断方法
8. Improved Feature Extraction, Feature Selection, and Identification Techniques That Create a Fast Unsupervised Hyperspectral Target Detection Algorithm [R] . Johnson, R. J. 2008

机译：改进的特征提取，特征选择和识别技术，创建快速无监督的高光谱目标检测算法

Topic Extraction Method from Millions of Tweets Based on Fast Feature Selection Technique CWC

摘要

著录项

相似文献

相关主题

期刊订阅