首页> 外文OA文献 >A hybrid framework for news clustering based on the DBSCAN-Martingale and LDA
【2h】

A hybrid framework for news clustering based on the DBSCAN-Martingale and LDA

机译:基于DBsCaN-martingale和LDa的新闻聚类的混合框架

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.
机译:如今,新闻工作者和媒体监视公司非常需要在大量网络文章中聚集新闻,以确保快速访问他们感兴趣的主题或事件。我们在这项工作中的目的是在没有先验知识簇数的情况下,识别共享共同主题或事件的新闻文章组。由于存在“噪声”,即与所有其他主题均不相关的新闻,因此,估计正确的主题数是一个具有挑战性的问题。在这种情况下,我们引入了一种基于密度的新型新闻聚类框架,其中,通过完善的Latent Dirichlet分配方法将新闻文章分配给主题,但是通过我们的新颖方法对聚类数进行估算,称为“ DBSCAN-Martingale”,它允许从数据集中提取噪声,并逐步从OPTICS可达性图中提取聚类。我们在20newsgroups-mini数据集和220篇网络新闻文章(这些文章是对特定Wikipedia页面的引用)上评估我们的框架和DBSCAN-Martingale的。在二十种新闻聚类方法中,在不知道聚类数k的情况下,DBSCAN-Martingale的框架提供了正确的聚类数和最高的归一化互信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号