...
首页> 外文期刊>The Annals of applied statistics >Model trees with topic model preprocessing: An approach for data journalism illustrated with the Wikileaks Afghanistan war logs
【24h】

Model trees with topic model preprocessing: An approach for data journalism illustrated with the Wikileaks Afghanistan war logs

机译:带有主题模型预处理的模型树:通过Wikileaks阿富汗战争日志说明的数据新闻方法

获取原文
获取原文并翻译 | 示例
           

摘要

The WikiLeaks Afghanistan war logs contain nearly 77,000 reports of incidents in the US-led Afghanistan war, covering the period from January 2004 to December 2009. The recent growth of data on complex social systems and the potential to derive stories from them has shifted the focus of journalistic and scientific attention increasingly toward data-driven journalism and computational social science. In this paper we advocate the usage of modern statistical methods for problems of data journalism and beyond, which may help journalistic and scientific work and lead to additional insight. Using the WikiLeaks Afghanistan war logs for illustration, we present an approach that builds intelligible statistical models for interpretable segments in the data, in this case to explore the fatality rates associated with different circumstances in the Afghanistan war. Our approach combines preprocessing by Latent Dirichlet Allocation (LDA) with model trees. LDA is used to process the natural language information contained in each report summary by estimating latent topics and assigning each report to one of them. Together with other variables these topic assignments serve as splitting variables for finding segments in the data to which local statistical models for the reported number of fatalities are fitted. Segmentation and fitting is carried out with recursive partitioning of negative binomial distributions. We identify segments with different fatality rates that correspond to a small number of topics and other variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This gives an unprecedented description of the war in Afghanistan and serves as an example of how data journalism, computational social science and other areas with interest in database data can benefit from modern statistical techniques.
机译:WikiLeaks阿富汗战争日志包含从2004年1月至2009年12月这段时期的以美国为首的阿富汗战争事件的近77,000份报告。最近有关复杂社会系统的数据的增长以及从中提取故事的潜力已将重点转移了新闻界和科学界对数据驱动新闻和计算社会科学的关注日益增加。在本文中,我们提倡将现代统计方法用于数据新闻学及其他方面的问题,这可能有助于新闻和科学工作并带来更多见解。通过使用WikiLeaks阿富汗战争日志进行说明,我们提出了一种方法,该方法为数据中的可解释部分建立了可理解的统计模型,在这种情况下,旨在探讨与阿富汗战争中不同情况相关的死亡率。我们的方法将潜在Dirichlet分配(LDA)的预处理与模型树结合在一起。 LDA用于通过估计潜在主题并将每个报告分配给其中一个来处理每个报告摘要中包含的自然语言信息。这些主题分配与其他变量一起用作拆分变量,用于在数据中查找适用于报告的死亡人数的本地统计模型的细分。使用负二项式分布的递归分区进行分割和拟合。我们确定了具有不同死亡率的细分,这些细分与少数主题和其他变量及其相互作用相对应。此外,我们可以找出细分受众群之间的相似之处,并将它们与媒体报道的故事联系起来。这是对阿富汗战争的前所未有的描述,并作为数据新闻学,计算机科学和其他对数据库数据感兴趣的领域如何从现代统计技术中受益的示例。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号