首页> 外文期刊>Knowledge-Based Systems >A clustering technique for news articles using WordNet
【24h】

A clustering technique for news articles using WordNet

机译:使用WordNet的新闻文章聚类技术

获取原文
获取原文并翻译 | 示例
           

摘要

The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed which, however, suffer from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. In this work, we are investigating the application of a great spectrum of clustering algorithms, as well as similarity measures, to news articles that originate from the Web. Also, we are proposing the enhancement of standard k-means algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the "bag of words" used prior to the clustering process and assisting the label generation procedure following it. Furthermore, we are examining the effect that text preprocessing has on clustering. Operating on a corpus of news articles derived from major news portals, our comparison of the existing clustering methodologies revealed that k-means, gives better aggregate results when it comes to efficiency. This is amplified when the algorithm is accompanied with preliminary steps for data cleaning and normalizing, despite its simple nature. Moreover, the proposed WordNet-enabled W-k means clustering algorithm significantly improves standard k-means generating also useful and high quality cluster tags by using the presented cluster labeling process.
机译:网络上挤满了新闻报道,新闻报道的数量和种类繁多。文档集群是一种强大的技术,已广泛用于将数据组织到较小且可管理的信息内核中。已经提出了几种方法,但是这些方法遭受诸如同义,歧义和缺乏对所生成的簇的描述性内容标记的问题。在这项工作中,我们正在研究各种各样的聚类算法以及相似性度量在来自Web的新闻报道中的应用。此外,我们还建议以双重方式使用WordNet上位词的外部知识来增强标准k均值算法:丰富聚类过程之前使用的“单词袋”,并协助其后的标签生成过程。此外,我们正在研究文本预处理对聚类的影响。通过对来自主要新闻门户的新闻文章进行操作,我们对现有聚类方法的比较表明,就效率而言,k均值可以提供更好的汇总结果。尽管该算法具有简单性质,但伴随有用于数据清理和标准化的初步步骤时,这一点就会放大。此外,所提出的启用WordNet的W-k意味着聚类算法通过使用提出的聚类标记过程显着改善了标准k均值,生成了有用且高质量的聚类标签。

著录项

  • 来源
    《Knowledge-Based Systems》 |2012年第2012期|115-128|共14页
  • 作者单位

    Computer Technology Institute and Press "Diophantus", Patras, Greece Computer Engineering and Informatics Department, University of Patras, 26500, Rion, Patras, Greece;

    Computer Technology Institute and Press "Diophantus", Patras, Greece Computer Engineering and Informatics Department, University of Patras, 26500, Rion, Patras, Greece;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    news clustering; k-means; W-k means; cluster labeling; partitional clustering;

    机译:新闻集群;k均值W-k表示;集群标签;分区聚类;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号