首页> 外文OA文献 >A Semi-supervised Document Clustering Algorithm based on EM
【2h】

A Semi-supervised Document Clustering Algorithm based on EM

机译:一种基于Em的半监督文档聚类算法

摘要

Document clustering is a very hard task in automatic text processing since it requires extracting regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized. Semi-supervised clustering lies in between automatic categorization and auto-organization. It is assumed that the supervisor is not required to specify a set of classes, but only to provide a set of texts grouped by the criteria to be used, to organize the collection. In this paper, we present a novel algorithm for clustering text documents which exploits the EM algorithm together with a feature selection technique based on information gain. The experimental results show that only very few documents are needed to initialize the clusters and that the algorithm is able to properly extract the regularities hidden in a huge unlabeled collection.
机译:在自动文本处理中,文档聚类是一项非常艰巨的任务,因为它需要从文档集合中提取常规模式,而无需事先了解类别结构。对于人类来说,这项任务也可能很困难,因为同一集合可能存在许多不同但有效的分区。此外,关于类别的信息的缺乏使得难以应用有效的特征选择技术来减少文本表示中的噪声。尽管存在这些固有的困难,但是文本聚类对于Web搜索应用程序是一项重要任务,在Web搜索应用程序中,必须自动组织庞大的集合或相当长的查询结果列表。半监督群集位于自动分类和自动组织之间。假定不需要主管指定一组类别,而只是提供一组由要使用的标准分组的文本来组织集合。在本文中,我们提出了一种新的文本文档聚类算法,该算法利用EM算法以及基于信息增益的特征选择技术。实验结果表明,只需很少的文档即可初始化聚类,并且该算法能够正确提取隐藏在庞大的未标记集合中的规则性。

著录项

  • 作者

    M. MAGGINI; L. RIGUTINI;

  • 作者单位
  • 年度 2005
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号