首页> 外文学位 >Predictive and Interpretable Text Machine Learning Models with Applications in Political Science
【24h】

Predictive and Interpretable Text Machine Learning Models with Applications in Political Science

机译:可预测和可解释的文本机器学习模型及其在政治学中的应用

获取原文
获取原文并翻译 | 示例

摘要

In this era, massive amounts of data are routinely collected and warehoused to be analyzed for scientific and industrial goals. Text data are a major constituent of these data treasure troves. However, with the steep increase in the amount and variety of accessible text data, it has become very difficult for a human to meaningfully analyze textual data without the help of automated text machine learning models. Topic models are one such method. They reduce the cost of analyzing large-scale corpora by identifying, in an unsupervised manner, the underlying thematic structure of the corpus. This thematic structure provides a coarse summary of the documents and allows researchers to quickly explore how topics connect with each other and change over time.;The success of automated topical analysis by topic models has led to another interesting area of text analysis: sentiment analysis. Sentiment analysis is the detecting of opinions, feelings, and general sentiments expressed in text. Sentiment analysis gained relevancy through the rise of social media platforms which increased the amount of sentiment-containing text data, such as Yelp reviews, Tweets, and opinion blogs. Efficient and effective sentiment analysis of such corpora will lead to valuable information about political and social discourse. Hence, social scientists have become increasingly interested in identifying and measuring the relationship between topics and associated sentiments to better understand social and political cultures, attitudes, and processes.;In Part 1 of this thesis, we propose a statistical model of text which simultaneously detects both topic and sentiment and allows for the inclusion of document metadata. The proposed model improves upon existing topic-sentiment models in two ways: i) the assumption that topics are associated with a range of sentiments and ii) the ability to use document-level covariates for improved estimation and analysis of the relationship between topics and sentiments. By applying the proposed model to two different datasets, i) a collection of political blogposts and ii) Yelp reviews, we demonstrate how detection of both topic and sentiment with the inclusion of document-level covariates can allow for more informative model summaries as compared to current topic and topic-sentiment models.;Topic models are easy to use and interpret; therefore, many variants of topic models have been developed to customize them to various research applications. Evaluation of topic models are thus necessary for appropriate model selection. For this reason, in part II of this thesis, we develop three new metrics which improve upon the existing evaluation approaches by identifying the benefits of topic-sentiment models over topic models.;Our evaluation metrics are based on three important criteria: sentiment prediction accuracy, feature stability, and computation time. Not only is it important to be able to show that one model achieves higher sentiment prediction accuracy over another, but it is also vital to ensure that the features used to generate a prediction are meaningful and stable, and that the algorithm has reasonable computational speed. We will use these three metrics to compare our proposed topic-sentiment model to topic models using a case study in which we aim to predict the partisanship and tone of political TV ads. Moreover, since these metrics are not specific to topic models, we will also provide a comparison of topic models with word2vec and Concise Comparative Summaries (CCS) which, to the best of our knowledge, has not been done before. We demonstrate that although the proposed topic-sentiment model is able to better predict sentiment than topic models, word2vec had the highest prediction accuracy and CCS identified the most stable features for prediction and both models required less computation time.
机译:在这个时代,例行收集和存储大量数据以进行分析以实现科学和工业目标。文本数据是这些数据宝库的主要组成部分。但是,随着可访问文本数据的数量和种类的急剧增加,在没有自动文本机器学习模型的帮助下,人们很难有意义地分析文本数据。主题模型就是这样一种方法。通过以无人监督的方式确定语料库的基础主题结构,它们降低了分析大型语料库的成本。这种主题结构提供了文档的粗略摘要,使研究人员能够快速探索主题之间的相互联系以及随着时间的推移而发生变化。主题模型自动进行主题分析的成功导致了文本分析的另一个有趣领域:情感分析。情感分析是检测文本中表达的观点,感觉和一般情感。随着社交媒体平台的兴起,情感分析变得越来越重要,社交媒体平台增加了包含情感的文本数据的数量,例如Yelp评论,Tweets和意见博客。对这样的语料库进行有效的情感分析将获得有关政治和社会话语的宝贵信息。因此,社会科学家对识别和测量主题与相关情感之间的关系越来越感兴趣,以更好地理解社会和政治文化,态度和过程。在本论文的第1部分中,我们提出了一种文本的统计模型,该模型可以同时检测主题和情感,并允许包含文档元数据。所提出的模型以两种方式改进了现有主题情感模型:i)主题与一系列情感相关联的假设; ii)使用文档级协变量来改进对主题与情感之间关系的估计和分析的能力。通过将建议的模型应用于两个不同的数据集,即i)政治博客文章集和ii)Yelp评论,我们演示了如何通过同时包含文档级协变量来检测主题和情感,从而可以提供比模型更多的信息摘要当前主题和主题情感模型。主题模型易于使用和解释;因此,已经开发了主题模型的许多变体来定制它们以适应各种研究应用。因此,主题模型的评估对于选择合适的模型是必要的。因此,在本论文的第二部分中,我们通过识别主题情感模型相对于主题模型的优势,开发了三个新的度量标准,它们改进了现有的评估方法。我们的评估度量标准基于三个重要标准:情感预测准确性,功能稳定性和计算时间。不仅重要的是能够证明一个模型比另一个模型具有更高的情感预测准确度,而且对于确保用于生成预测的特征有意义和稳定以及算法具有合理的计算速度也至关重要。我们将使用这三个指标,通过案例研究将我们建议的主题情感模型与主题模型进行比较,在该案例中我们旨在预测政治电视广告的党派和语气。此外,由于这些指标不是特定于主题模型的,因此我们还将提供与word2vec和简明比较摘要(CCS)进行的主题模型的比较,据我们所知,CCS尚未进行过。我们证明,尽管提出的主题情感模型比主题模型能够更好地预测情感,但是word2vec的预测准确性最高,CCS识别出最稳定的预测功能,并且两个模型都需要更少的计算时间。

著录项

  • 作者

    Kuang, Christine Yai.;

  • 作者单位

    University of California, Berkeley.;

  • 授予单位 University of California, Berkeley.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 122 p.
  • 总页数 122
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号