首页> 外文OA文献 >Statistical data mining for Sina Weibo, a Chinese micro-blog: sentiment modelling and randomness reduction for topic modelling
【2h】

Statistical data mining for Sina Weibo, a Chinese micro-blog: sentiment modelling and randomness reduction for topic modelling

机译:中国微博新浪微博的统计数据挖掘:主题建模的情绪建模和随机性降低

摘要

Before the arrival of modern information and communication technology, it was not easy to capture people’s thoughts and sentiments; however, the development of statistical data mining techniques and the prevalence of mass social media provide opportunities to capture those trends. Among all types of social media, micro-blogs make use of the word limit of 140 characters to force users to get straight to thepoint, thus making the posts brief but content-rich resources for investigation. The data mining object of this thesis is Weibo, the most popular Chinese micro-blog.udIn the first part of the thesis, we attempt to perform various exploratory data mining on Weibo. After the literature review of micro-blogs, the initial steps of data collection and data pre-processing are introduced. This is followed by analysis of the time of the posts, analysis between intensity of the post and share price, term frequency and cluster analysis.udSecondly, we conduct time series modelling on the sentiment of Weibo posts. Considering the properties of Weibo sentiment, we mainly adopt the framework of ARMA mean with GARCH type conditional variance to fit the patterns. Other distinct models are also considered for negative sentiment for its complexity. Model selection and validation are introduced to verify the fitted models.udThirdly, Latent Dirichlet Allocation (LDA) is explained in depth as a way to discover topics from large sets of textual data. The major contribution is creating a Randomness Reduction Algorithm applied to post-process the output of topic models, filtering out the insignificant topics and utilising topic distributions to find out the most persistent topics. At the end of this chapter, evidence of theudeffectiveness of the Randomness Reduction is presented from empirical studies. The topic classification and evolution is also unveiled.
机译:在现代信息和通信技术出现之前,要捕捉人们的思想和情感并不容易。然而,统计数据挖掘技术的发展和大众社交媒体的普及为抓住这些趋势提供了机会。在所有类型的社交媒体中,微博客利用140个字符的字数限制来迫使用户直截了当,从而使帖子简短但内容丰富,可供调查。本文的数据挖掘对象是中国最受欢迎的微博微博。 ud在本文的第一部分,我们尝试对微博进行各种探索性数据挖掘。在对微博客进行文献回顾之后,介绍了数据收集和数据预处理的初始步骤。其次是发帖时间的分析,发帖强度与股价之间的关系分析,期限频率和聚类分析。 ud其次,我们对微博发帖的情绪进行时间序列建模。考虑到微博情绪的属性,我们主要采用带有GARCH类型条件方差的ARMA均值框架来拟合模式。其他复杂的模型也被认为具有负面情绪。引入了模型选择和验证来验证拟合的模型。 ud,第三,对潜在的狄利克雷分配(LDA)进行了深入解释,作为从大量文本数据中发现主题的一种方法。主要贡献在于创建了一种随机性降低算法,该算法可用于对主题模型的输出进行后处理,过滤掉无关紧要的主题并利用主题分布来找出最持久的主题。在本章的最后,通过经验研究提供了减少随机性的有效性的证据。主题分类和演变也将揭晓。

著录项

  • 作者

    Cheng Wenqian;

  • 作者单位
  • 年度 2017
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号