Discovering Diverse and Salient Threads in Document Collections

机译：在文档集中发现多样化和显着的线程

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads-singly-linked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

机译：我们提出了一种新颖的概率技术，用于从大型文档集中建模和提取显着结构。就像在聚类和主题建模中一样，我们的目标是提供一种组织视角，以了解其他方面不胜枚举的信息。我们对揭示和利用文档之间的关系特别感兴趣。为此，我们着重于提取重要文档的单线链接，连贯的链的不同集合。为了说明这一点，我们从引文图表中提取研究线索，并从新闻文章中构建时间表。我们的方法具有高度可扩展性，可以在大约四分钟的时间内运行超过3,000万个单词的语料库，比动态主题模型快75倍以上。最后，根据几个指标，我们模型的结果更类似于人类新闻摘要，并且也受到人类法官的青睐。

著录项

来源
《Conference on empirical methods in natural language processing;Conference on computational natural language learning》|2012年|710-720|共11页
会议地点
作者
Jennifer Gillenwater; Alex Kulesza; Ben Taskar;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A rough set model with ontologies for discovering maximal association rules in document collections [J] . Yaxin Bi, Terry Anderson, Sally McClean Knowledge-Based Systems . 2003,第5a6期

机译：具有本体的粗糙集模型，用于发现文档集合中的最大关联规则
2. Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections [J] . Slivkins Aleksandrs, Radlinski Filip, Gollapudi Sreenivas Journal of machine learning research . 2013,第Feb期

机译：度量空间中的排名匪：在大型文档集中学习各种排名
3. Diverse Population, Diverse Collection? Youth Collections in the United States [J] . VIRGINIA KAY WILLIAMS, NANCY DEYOE Technical services quarterly . 2014,第2期

机译：多样化的人口，多样化的收藏？美国的青年收藏
4. Discovering Diverse and Salient Threads in Document Collections [C] . Jennifer Gillenwater, Alex Kulesza, Ben Taskar Conference on empirical methods in natural language processing . 2012

机译：在文件集合中发现不同和突出的线程
5. Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications. [D] . Danilevsky, Marina Grigoryevna. 2014

机译：在文档集合和带有文本组件的网络中发现潜在的主题短语：利用面向人类的应用程序的文本挖掘和信息网络分析。
6. Document retrieval on repetitive string collections [O] . Travis Gagie, Aleksi Hartikainen, Kalle Karhu, -1

机译：重复字符串集合的文档检索
7. Segmentation-Based Retrieval of Document Images from Diverse Collections [O] . Michael A. Moll, Henry S. Baird 2008

机译：基于分割的不同集合文档图像检索

Discovering Diverse and Salient Threads in Document Collections

摘要

著录项

相似文献

相关主题

期刊订阅