...
首页> 外文期刊>International Journal of Data Science and Analytics >CPI-model-based analysis of sparse k-means clustering algorithms
【24h】

CPI-model-based analysis of sparse k-means clustering algorithms

机译:基于CPI模型的稀疏k均值分析算法分析

获取原文
获取原文并翻译 | 示例
           

摘要

Standard k-means clustering algorithms have been widely used to solve the partitioning problems of a given data set into k disjoint subsets. When a data set is large-scale and high-dimensional sparse, such as text data with a bag-of-words representation, it is not trivial which representations are adopted for both the data and mean sets. Additionally, algorithms that differ only in their representations need distinct elapsed times until their convergences, despite starting at an identical initial state and executing an identical number of similarity calculations, which is a conventional indicator of speed performance. We design sparse k-means clustering algorithms that utilize distinct representations, each of which is a pair of a data structure and an expression. Our purpose is to clarify the cause of their performance differences and identify the best algorithm when they are executed in a modern computer system. We analyze the algorithms with a simple yet practical clock-cycle per instruction (CPI) model that is expressed as a linear combination of four performance degradation factors in a modern computer system: the completed instructions, the level-1 and last-level cache misses, and the branch mispredictions. We also optimize the model parameters by a newly introduced procedure and demonstrate that CPIs calculated with our model agree well with experimental results when the algorithms are applied to large-scale and high-dimensional real document data sets. Furthermore, our model clarifies that the best algorithm among them suppresses the performance degradation factors of the number of cache misses, the branch mispredictions, and the completed instructions.
机译:标准K-means聚类算法已被广泛用于解决给定数据集的划分问题设置为k个不相交子集。当数据集是大规模和高维稀疏时,例如具有单词袋式表示的文本数据时,它并不易于为数据和均值集采用该表示。另外,尽管以相同的初始状态启动并执行相同数量的相似性计算,但在它们的陈述中仅不同的算法需要不同的经过时间,直到它们的收敛,并且执行相同数量的相似性计算,这是速度性能的传统指示符。我们设计利用不同表示的稀疏k-means聚类算法,每个表示是数据结构和表达式。我们的目的是澄清其性能差异的原因,并在现代计算机系统中执行时识别最佳算法。我们通过每个指令(CPI)模型的简单但实用的时钟周期分析了算法,该模型表示为现代计算机系统中的四种性能劣化因子的线性组合:完成的指令,1级和最后一级缓存未命中和分支错误预测。我们还通过新介绍的程序优化了模型参数,并证明了当算法应用于大规模和高维实物数据集时,通过我们的模型计算的CPI与实验结果很好。此外,我们的模型澄清了它们中最好的算法抑制了缓存未命中的数量,分支错误预测和完成指令的性能劣化因素。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号