【24h】

Refine the Corpora Based on Document Manifold

机译:根据文件歧管改进语料库

获取原文

摘要

Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly.
机译:如今,跟踪和利用因特网生成的压倒性的新闻信息是非常具有挑战性的。一种方法是使用主题模型,例如PLSI,LDA,LPI,LAPPLSI,LTM等,自动发现新闻主题。但是,在许多真正的应用程序中,所有这些模型推断的主题都不有用,因为总是有一部分文件实际上属于任何主题。在本文中,我们提出了一种在主题建模之前改进文件的新技术。灵感来自歧管理论,我们使用Laplacian Eigenmaps来发现文档空间的子胺属结构,并尝试找到与其他文档有松散关系的那些文档,然后将它们从Corpora中排除。实验表明,主题模型与我们的算法相结合,可以显着提高主题的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号