首页> 外文期刊>Journal of information & knowledge management >High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing
【24h】

High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

机译:基于Cuckoo搜索和潜在语义索引的高维文本数据集聚类算法

获取原文
获取原文并翻译 | 示例
       

摘要

The clustering is an important data analysis technique. However, clustering high-dimensional data like documents needs more effort in order to extract the richness relevant information hidden in the multidimensionality space. Recently, document clustering algorithms based on metaheuristics have demonstrated their efficiency to explore the search area and to achieve the global best solution rather than the local one. However, most of these algorithms are not practical and suffer from some limitations, including the requirement of the knowledge of the number of clusters in advance, they are neither incremental nor extensible and the documents are indexed by high-dimensional and sparse matrix. In order to overcome these limitations, we propose in this paper, a new dynamic and incremental approach (CS_LSI) for document clustering based on the recent cuckoo search (CS) optimization and latent semantic indexing (LSI). Conducted Experiments on four well-known high-dimensional text datasets show the efficiency of LSI model to reduce the dimensionality space with more precision and less computational time. Also, the proposed CS_LSI determines the number of clusters automatically by employing a new proposed index, focused on significant distance measure. This later is also used in the incremental mode and to detect the outlier documents by maintaining a more coherent clusters. Furthermore, comparison with conventional document clustering algorithms shows the superiority of CS_LSI to achieve a high quality of clustering.
机译:聚类是一个重要的数据分析技术。然而,像文档一样的聚类高维数据需要更多的努力,以便在多维空间中隐藏隐藏的丰富相关信息。最近,基于Metaheuristics的文档聚类算法已经证明了他们探索搜索区域的效率,并实现了全球最佳解决方案而不是本地的效率。然而,这些算法中的大多数是不实际的并且遭受一些限制,包括提前群集的知识的要求,它们既不是增量也不是可扩展,并且文档由高维和稀疏矩阵索引。为了克服这些限制,我们提出了一种基于最近的Cuckoo搜索(CS)优化和潜在语义索引(LSI)的文档聚类的新动态和增量方法(CS_LSI)。对四个众所周知的高维文本数据集进行了实验,展示了LSI模型的效率,以通过更精度和更少的计算时间来降低维度空间。此外,所提出的CS_LSI通过采用新的提出指数来自动确定群集数,专注于显着的距离测量。稍后也用于增量模式,并通过维护更加连贯的群集来检测异常文档。此外,与传统文档聚类算法的比较显示了CS_LSI的优越性,以实现高质量的聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号