CPI-model-based analysis of sparse k-means clustering algorithms

Kazuo Aoyama; Kazumi Saito; Tetsuo Ikeda

首页> 外文期刊>International Journal of Data Science and Analytics >CPI-model-based analysis of sparse k-means clustering algorithms

【24h】

CPI-model-based analysis of sparse k-means clustering algorithms

机译：基于CPI模型的稀疏k均值分析算法分析

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Standard k-means clustering algorithms have been widely used to solve the partitioning problems of a given data set into k disjoint subsets. When a data set is large-scale and high-dimensional sparse, such as text data with a bag-of-words representation, it is not trivial which representations are adopted for both the data and mean sets. Additionally, algorithms that differ only in their representations need distinct elapsed times until their convergences, despite starting at an identical initial state and executing an identical number of similarity calculations, which is a conventional indicator of speed performance. We design sparse k-means clustering algorithms that utilize distinct representations, each of which is a pair of a data structure and an expression. Our purpose is to clarify the cause of their performance differences and identify the best algorithm when they are executed in a modern computer system. We analyze the algorithms with a simple yet practical clock-cycle per instruction (CPI) model that is expressed as a linear combination of four performance degradation factors in a modern computer system: the completed instructions, the level-1 and last-level cache misses, and the branch mispredictions. We also optimize the model parameters by a newly introduced procedure and demonstrate that CPIs calculated with our model agree well with experimental results when the algorithms are applied to large-scale and high-dimensional real document data sets. Furthermore, our model clarifies that the best algorithm among them suppresses the performance degradation factors of the number of cache misses, the branch mispredictions, and the completed instructions.

机译：标准K-means聚类算法已被广泛用于解决给定数据集的划分问题设置为k个不相交子集。当数据集是大规模和高维稀疏时，例如具有单词袋式表示的文本数据时，它并不易于为数据和均值集采用该表示。另外，尽管以相同的初始状态启动并执行相同数量的相似性计算，但在它们的陈述中仅不同的算法需要不同的经过时间，直到它们的收敛，并且执行相同数量的相似性计算，这是速度性能的传统指示符。我们设计利用不同表示的稀疏k-means聚类算法，每个表示是数据结构和表达式。我们的目的是澄清其性能差异的原因，并在现代计算机系统中执行时识别最佳算法。我们通过每个指令（CPI）模型的简单但实用的时钟周期分析了算法，该模型表示为现代计算机系统中的四种性能劣化因子的线性组合：完成的指令，1级和最后一级缓存未命中和分支错误预测。我们还通过新介绍的程序优化了模型参数，并证明了当算法应用于大规模和高维实物数据集时，通过我们的模型计算的CPI与实验结果很好。此外，我们的模型澄清了它们中最好的算法抑制了缓存未命中的数量，分支错误预测和完成指令的性能劣化因素。

著录项

来源
《International Journal of Data Science and Analytics》 |2021年第3期|229-248|共20页
作者
Kazuo Aoyama; Kazumi Saito; Tetsuo Ikeda;
展开▼
作者单位

NTT Communication Science Laboratories 2-4 Hikaridai Seika-cho Soraku-gun Kyoto 619-0237 Japan;

Kanagawa University 2946 Tsuchiya Hiratsuka-shi Kanagawa 259-1293 Japan;

University of Shizuoka 52-1 Yada Suruga-ku Shizuoka 422-8526 Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Clustering; Algorithms; Performance analysis; Data structure; Sparse data; k-means;

机译：聚类;算法;性能分析;数据结构;稀疏数据;K-means.;

相似文献

外文文献
中文文献
专利

1. A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data [J] . Hussain Syed Fawad, Haris Muhammad Expert Systems with Application . 2019,第MARa期

机译：针对稀疏，高维数据的基于k均值的共聚簇（kCC）算法
2. RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm [J] . Yumi Kondo, Matias Salibian-Barrera, Ruben Zamar Journal of Statistical Software . 2016,第1期

机译：RSKC：鲁棒且稀疏的K均值聚类算法的R包
3. An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data [J] . Jing Liping, Ng Michael K., Huang Joshua Zhexue IEEE Transactions on Knowledge and Data Engineering . 2007,第8期

机译：高维稀疏数据子空间聚类的熵权k均值算法
4. Sparse component analysis based on an improved ant K-means clustering algorithm for underdetermined blind source separation [C] . Shuang Wei, Feng Wang, Defu Jiang IEEE International Conference on Networking, Sensing and Control . 2019

机译：欠定盲源分离的基于改进蚁群均值聚类算法的稀疏成分分析
5. Clustering educational digital library usage data: Comparisons of latent class analysis and K-means algorithms [D] . Xu, Beijie 2011

机译：聚集教育数字图书馆使用数据：潜在类别分析和K-means算法的比较
6. Does Determination of Initial Cluster Centroids Improve the Performance of K-Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm Minimum Spanning Tree and Hierarchical Clustering in an Applied Study [O] . Saeedeh Pourahmad, Atefeh Basirat, Amir Rahimi, 2020

机译：初始簇质心的确定是否提高了K-Means聚类算法的性能？应用研究中遗传算法最小生成树和分层聚类的三种混合方法的比较
7. RSKC: AnRPackage for a Robust and Sparse K-Means Clustering Algorithm [O] . Yumi Kondo, Matias Salibian-Barrera, Ruben Zamar 2016

机译：RSKC：用于稳健和稀疏k均值聚类算法的ANRPACKAGE

CPI-model-based analysis of sparse k-means clustering algorithms

摘要

著录项

相似文献

相关主题

期刊订阅