首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Pattern-based Topics for Document Modelling in Information Filtering
【24h】

Pattern-based Topics for Document Modelling in Information Filtering

机译:信息过滤中基于模式的文档建模主题

获取原文
获取原文并翻译 | 示例

摘要

Many mature term-based or pattern-based approaches have been used in the field of information filtering to generate users’ information needs from a collection of documents. A fundamental assumption for these approaches is that the documents in the collection are all about one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling, such as Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in a collection of documents, and this has been widely utilized in the fields of machine learning and information retrieval, etc. But its effectiveness in information filtering has not been so well explored. Patterns are always thought to be more discriminative than single terms for describing documents. However, the enormous amount of discovered patterns hinder them from being effectively and efficiently used in real applications, therefore, selection of the most discriminative and representative patterns from the huge amount of discovered patterns becomes crucial. To deal with the above mentioned limitations and problems, in this paper, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is proposed. The main distinctive features of the proposed model include: (1) user information needs are generated in terms of multiple topics; (2) each topic is represented by patterns; (3) patterns are generated from topic models and are organized in terms of their statistical and taxonomic features; and (4) the most discriminative and representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance to the user’s information needs in order to filter out irrelevant documents. Extensive experiments are conducted to evaluate the effectiveness of the proposed model by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed- model significantly outperforms both state-of-the-art term-based models and pattern-based models.
机译:在信息过滤领域,已经使用了许多成熟的基于术语或基于模式的方法来从文档集合中生成用户的信息需求。这些方法的基本假设是,集合中的文档都是关于一个主题的。但是,实际上,用户的兴趣可能是多种多样的,并且馆藏中的文档通常涉及多个主题。提出了诸如潜在狄利克雷分配(LDA)之类的主题模型来生成统计模型来表示文档集中的多个主题,并且该模型已在机器学习和信息检索等领域得到了广泛应用。信息过滤尚未得到很好的探索。人们总是认为模式比描述文档的单个术语更具区分性。但是,大量发现的模式阻碍了它们在实际应用中的有效使用,因此,从大量发现的模式中选择最具区分性和代表性的模式就变得至关重要。为了解决上述局限性和问题,本文提出了一种新颖的信息过滤模型,即基于最大匹配模式的主题模型(MPBTM)。该模型的主要特点包括:(1)用户信息需求是根据多个主题产生的; (2)每个主题都用模式表示; (3)模式是根据主题模型生成的,并根据其统计和分类特征进行组织; (4)提出了最具判别力和代表性的模式,称为最大匹配模式,以估计与用户信息需求相关的文档,以过滤出不相关的文档。通过使用TREC数据收集路透社语料库第1卷,进行了广泛的实验,以评估所提出模型的有效性。结果表明,所提出的模型明显优于基于术语的最新模型和基于模式的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号