首页> 外文期刊>IEEE Transactions on Fuzzy Systems >Large Scale Document Categorization With Fuzzy Clustering
【24h】

Large Scale Document Categorization With Fuzzy Clustering

机译:带有模糊聚类的大规模文档分类

获取原文
获取原文并翻译 | 示例

摘要

Clustering documents into coherent categories is a very useful and important step for document processing and understanding. The introducing of fuzzy set theory into clustering provides a favorable mechanism to capture overlapping among document clusters. Document dataset is commonly represented as a collection of high-dimensional vectors, which may not be able to fit into memory entirely, when the dataset is large and with a very high dimensionality. However, most of the existing fuzzy clustering approaches deal with small and static datasets. Some of them may have a good scalability but they are only effective for low dimensional data. The study presented in this paper is about new efforts on fuzzy clustering of large-scale and high-dimensional data—especially suitable for document categorization. To consider both large scale and high dimensionality into the problem formulation, our key idea is to incorporate document-tailored fuzzy clustering into a scheme, which is effective for dealing with a large-scale problem. We first identified three representative schemes in fuzzy clustering for handling large-scale data, namely sampling extension, single pass, and divide ensemble. The limitation of fuzzy C-means (FCM)-based approaches for a large document clustering are then investigated. Based on the study, we propose new approaches by incorporating each of hyperspherical FCM and fuzzy coclustering with the three scale-up schemes, respectively. This enables our new approaches to maintain effectiveness for high-dimensional data with an extended scalability. Extensive experimental studies with real-world large document datasets have been conducted and the results demonstrate that the proposed approaches perform consistently better over existing ones in document categorization.
机译:将文档分为相关的类别是文档处理和理解的非常有用且重要的步骤。将模糊集理论引入聚类为捕获文档聚类之间的重叠提供了一种有利的机制。文档数据集通常表示为高维向量的集合,当数据集很大且具有非常高的维数时,它们可能无法完全装入内存。但是,大多数现有的模糊聚类方法都处理小型和静态数据集。其中一些可能具有良好的可伸缩性,但它们仅对低维数据有效。本文提出的研究是针对大规模和高维数据的模糊聚类的新工作,特别是适用于文档分类。为了在问题制定中同时考虑大规模和高维问题,我们的关键思想是将文档定制的模糊聚类纳入方案中,这对于处理大规模问题是有效的。我们首先确定了模糊聚类中用于处理大规模数据的三种代表性方案,即采样扩展,单遍和除法合奏。然后研究基于模糊C均值(FCM)的方法对大型文档聚类的局限性。在这项研究的基础上,我们提出了将超球面FCM和模糊聚类分别与三种放大方案结合在一起的新方法。这使我们的新方法能够以扩展的可扩展性保持高维数据的有效性。已经对真实世界的大型文档数据集进行了广泛的实验研究,结果表明,所提出的方法在文档分类方面的性能始终优于现有方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号