High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

Saida Ishak Boushaki; Nadjet Kamel; Omar Bendjeghaba

首页> 外文期刊>Journal of information & knowledge management >High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

【24h】

High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

机译：基于Cuckoo搜索和潜在语义索引的高维文本数据集聚类算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The clustering is an important data analysis technique. However, clustering high-dimensional data like documents needs more effort in order to extract the richness relevant information hidden in the multidimensionality space. Recently, document clustering algorithms based on metaheuristics have demonstrated their efficiency to explore the search area and to achieve the global best solution rather than the local one. However, most of these algorithms are not practical and suffer from some limitations, including the requirement of the knowledge of the number of clusters in advance, they are neither incremental nor extensible and the documents are indexed by high-dimensional and sparse matrix. In order to overcome these limitations, we propose in this paper, a new dynamic and incremental approach (CS_LSI) for document clustering based on the recent cuckoo search (CS) optimization and latent semantic indexing (LSI). Conducted Experiments on four well-known high-dimensional text datasets show the efficiency of LSI model to reduce the dimensionality space with more precision and less computational time. Also, the proposed CS_LSI determines the number of clusters automatically by employing a new proposed index, focused on significant distance measure. This later is also used in the incremental mode and to detect the outlier documents by maintaining a more coherent clusters. Furthermore, comparison with conventional document clustering algorithms shows the superiority of CS_LSI to achieve a high quality of clustering.

机译：聚类是一个重要的数据分析技术。然而，像文档一样的聚类高维数据需要更多的努力，以便在多维空间中隐藏隐藏的丰富相关信息。最近，基于Metaheuristics的文档聚类算法已经证明了他们探索搜索区域的效率，并实现了全球最佳解决方案而不是本地的效率。然而，这些算法中的大多数是不实际的并且遭受一些限制，包括提前群集的知识的要求，它们既不是增量也不是可扩展，并且文档由高维和稀疏矩阵索引。为了克服这些限制，我们提出了一种基于最近的Cuckoo搜索（CS）优化和潜在语义索引（LSI）的文档聚类的新动态和增量方法（CS_LSI）。对四个众所周知的高维文本数据集进行了实验，展示了LSI模型的效率，以通过更精度和更少的计算时间来降低维度空间。此外，所提出的CS_LSI通过采用新的提出指数来自动确定群集数，专注于显着的距离测量。稍后也用于增量模式，并通过维护更加连贯的群集来检测异常文档。此外，与传统文档聚类算法的比较显示了CS_LSI的优越性，以实现高质量的聚类。

著录项

来源
《Journal of information & knowledge management》 |2018年第3期|共24页
作者
Saida Ishak Boushaki; Nadjet Kamel; Omar Bendjeghaba;
展开▼
作者单位

*LRIA University of Science and Technology Houari Boumediene Bab Ezzouar 16123 Algeria;

*LRIA University of Science and Technology Houari Boumediene Bab Ezzouar 16123 Algeria;

§LREEI University M’Hamed Bougara Boumerdes Boumerdes 35000 Algeria;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类信息与传播理论;
关键词
Cuckoo search optimisation; high-dimensional text clustering; number of clusters; incremental clustering; internal validity index; latent semantic indexing; document clustering; vector space model; optimisation; metaheuristic;

机译：Cuckoo搜索优化;高维文本聚类;群集数;增量聚类;内部有效性指数;潜在语义索引;文档聚类;矢量空间模型;优化;沟培;

相似文献

外文文献
中文文献
专利

1. High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing [J] . Saida Ishak Boushaki, Nadjet Kamel, Omar Bendjeghaba Journal of information & knowledge management . 2018,第3期

机译：基于Cuckoo搜索和潜在语义索引的高维文本数据集聚类算法
2. Genetic algorithm for text clustering based on latent semantic indexing [J] . Wei Song, Soon Cheol Park Computers & mathematics with applications . 2009,第11a12期

机译：基于潜在语义索引的文本聚类遗传算法
3. Text clustering on latent semantic indexing with particle swarm optimization (PSO) algorithm [J] . Eisa Hasanzadeh, Morteza Poyan rad, Hamid Alinejad Rokny International Journal of Physical Sciences . 2012,第1期

机译：基于粒子群优化算法的潜在语义索引文本聚类
4. Genetic Algorithm for Text Clustering Based on Latent Semantic Indexing [C] . W.J.Liu, rnY.G.Zhu, rnP.Christie, The Second International Conference on Bio-Inspired Computing: Theories and Applications Conference Pre-proceedings . 2007

机译：基于潜在语义索引的文本聚类遗传算法
5. Text clustering using latent semantic indexing. [D] . Gee, Kevin Randall. 2001

机译：使用潜在语义索引的文本聚类。
6. Monte Carlo Tree Search-Based Recursive Algorithm for Feature Selection in High-Dimensional Datasets [O] . Muhammad Umar Chaudhry, Muhammad Yasir, Muhammad Nabeel Asghar, 2020

机译：基于蒙特卡罗树搜索的递归算法用于高维数据集中的特征选择
7. Genetic algorithm for text clustering based on latent semantic indexing [O] . Song Wei, Park Soon Cheol 2009

机译：基于潜在语义索引的文本聚类遗传算法
8. Similarity-Based Probability Model for Latent Semantic Indexing [R] . Ding, C. H. Q. 1999

机译：基于相似度的潜在语义索引概率模型

High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

摘要

著录项

相似文献

相关主题

期刊订阅