Refine the Corpora Based on Document Manifold

机译：根据文件歧管改进语料库

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly.

机译：如今，跟踪和利用因特网生成的压倒性的新闻信息是非常具有挑战性的。一种方法是使用主题模型，例如PLSI，LDA，LPI，LAPPLSI，LTM等，自动发现新闻主题。但是，在许多真正的应用程序中，所有这些模型推断的主题都不有用，因为总是有一部分文件实际上属于任何主题。在本文中，我们提出了一种在主题建模之前改进文件的新技术。灵感来自歧管理论，我们使用Laplacian Eigenmaps来发现文档空间的子胺属结构，并尝试找到与其他文档有松散关系的那些文档，然后将它们从Corpora中排除。实验表明，主题模型与我们的算法相结合，可以显着提高主题的质量。

著录项

来源
《International conference on advanced data mining and applications》|2013年||共10页
会议地点
作者
Chengwei Yao; Yilin Wang; Gencai Chen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词
topic model; manifold; graph Laplacian; document clustering;

机译：主题模型;歧管;图拉普拉斯;文档聚类;

相似文献

外文文献
中文文献
专利

1. The Congruity Between Linkage-Based Factors and Content-Based Clusters-An Experimental Study Using Multiple Document Corpora [J] . Tsung Teng Chen Journal of the American Society for Information Science and Technology . 2016,第3期

机译：基于链接的因素与基于内容的集群之间的一致性-使用多文档语料库的实验研究
2. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [J] . Ivan Vulić, Wim De Smet, Marie-Francine Moens Information Retrieval . 2013,第3期

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练
3. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [J] . Ivan Vulic, Wim De Smet, Marie-Francine Moens Information retrieval . 2013,第3期

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练
4. Refine the Corpora Based on Document Manifold [C] . Chengwei Yao, Yilin Wang, Gencai Chen International conference on advanced data mining and applications . 2013

机译：基于文档流形优化语料库
5. Edits Based Categorization of Crowd Sourced Document Corpora with Application to Wikipedia [D] . Fang, Yue 2018

机译：基于人群的文档库的基于编辑的分类及其在维基百科中的应用
6. FacetGist: Collective Extraction of Document Facets in Large Technical Corpora [O] . Tarique Siddiqui, Xiang Ren, Aditya Parameswaran, -1

机译：FacetGist：大型技术语料库中文档构面的集体提取
7. Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords [O] . Arzucan Özgür, Tunga Güngör 2008

机译：基于类和基于语料库的关键字对倾斜和同质文档语料库的分类

Refine the Corpora Based on Document Manifold

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅