首页> 外文会议>International Conference on Applied Human Factors and Ergonomics >Leveraging topic models to develop metrics for evaluating the quality of narrative threads extracted from news stories
【24h】

Leveraging topic models to develop metrics for evaluating the quality of narrative threads extracted from news stories

机译:利用主题模型开发评估从新闻故事中提取的叙事线程质量的指标

获取原文
获取外文期刊封面目录资料

摘要

Analysts and software systems are increasingly tasked with making sense of a growing amount of data to help their organizations make decisions involving risk and uncertainty. A key enabler of this work is the ability to quickly discover structure in large amounts of text such as news stories and blogs. Recent work in this area has shown it is possible to automatically link documents from a corpus together to build a narrative structure, called a story chain, without the need for prior domain knowledge [1]. This approach is an unsupervised method that discovers large numbers of story chains of variable quality. In this paper, we describe and evaluate methods to identify the most coherent and informative story chains. We explore two types of topic model based analytics. The first type is a measure of representativeness that captures how well a story chain represents the corpus from which it was generated. This is done by comparing the similarity of topics found over time in a story chain against those expressed in the corpus during the same time period. Our hypothesis is that story chains that have similar topic expression to the corpus will convey narratives that are central to the corpus. This type of analytic could help an analyst quickly focus on the key narratives in a large corpus of documents. The second type is a measure of quality of a story chain and is composed of topic consistency and topic persistence measures. Our hypothesis is that high quality chains would be composed of sequences of stories that have clearly defined primary topics that persist across significant portions of the story chain. We used these analytics to predict the clarity of story chains within one of four categories (1) very clear narrative, 2) somewhat clear narrative, 3) somewhat unclear narrative, 4) very unclear narrative, and found we were able to train a data model to label story chains with the same label as human coders 77% of the time. Our dataset was composed of 7,074 English language news stories released during the Brazil Protests of 2013 from which 5,606 story chains were generated. We randomly selected 60 story chains for hand scoring to serve as our gold standard data set for experimentation.
机译:分析师和软件系统越来越多地任务,了解越来越多的数据,以帮助他们的组织做出涉及风险和不确定性的决定。这项工作的关键推动者是能够在大量文本中快速发现结构,例如新闻报道和博客。该领域的最新工作表明,可以自动将文档从语料库中联系在一起,以构建一个叙述性结构,称为故事链,而无需现有域知识[1]。这种方法是一种无监督的方法,可以发现大量的可变质量的故事链。在本文中,我们描述并评估了识别最连贯和信息性故事链的方法。我们探索两种类型的主题模型分析。第一种类型是一种代表性的量度,捕获故事链的代表物质的核心率。这是通过比较在同一时间段内对故事链中的时间内发现的主题的相似性进行比较来完成的。我们的假设是对语料库具有类似主题表达的故事链将传达对语料库的核心。这种类型的分析可以帮助分析师快速关注大型文件语料库中的关键叙述。第二种类型是故事链质量的衡量标准,由主题一致性和主题持久度措施组成。我们的假设是高质量的链将由清楚地定义了故事链的重要部分的主要主题的故事序列组成。我们利用这些分析来预测四个类别中的一个(1)非常清晰的叙述中的一个故事链的清晰度,2)有些清晰的叙述,3)有些不明确的叙述,4)叙述非常不明确,发现我们能够培训数据模型将故事链标记为与人类编码者相同的标签77%的时间。我们的数据集由2013年巴西抗议活动中发布的7,074名英语新闻故事组成,从中生成5,606个故事链。我们随机选择了60个故事链,以便手动得分作为我们的黄金标准数据集进行实验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号