【24h】

Discovering Diverse and Salient Threads in Document Collections

机译:在文档集中发现多样化和显着的线程

获取原文

摘要

We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads-singly-linked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.
机译:我们提出了一种新颖的概率技术,用于从大型文档集中建模和提取显着结构。就像在聚类和主题建模中一样,我们的目标是提供一种组织视角,以了解其他方面不胜枚举的信息。我们对揭示和利用文档之间的关系特别感兴趣。为此,我们着重于提取重要文档的单线链接,连贯的链的不同集合。为了说明这一点,我们从引文图表中提取研究线索,并从新闻文章中构建时间表。我们的方法具有高度可扩展性,可以在大约四分钟的时间内运行超过3,000万个单词的语料库,比动态主题模型快75倍以上。最后,根据几个指标,我们模型的结果更类似于人类新闻摘要,并且也受到人类法官的青睐。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号