A highly efficient multi-core algorithm for clustering extremely large datasets

Johann M Kraus; Hans A Kestler

首页> 外文期刊>BMC Bioinformatics >A highly efficient multi-core algorithm for clustering extremely large datasets

【24h】

A highly efficient multi-core algorithm for clustering extremely large datasets

机译：一种高效的多核算法，用于对超大型数据集进行聚类

获取原文

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

机译：背景技术近年来，由于来自微阵列和其他高通量技术的数据集快速增长，对计算生物学的计算能力的需求已经增加。这种需求可能会增加。需要对用于分析数据的标准算法（例如集群算法）进行并行化以进行快速处理。不幸的是，大多数并行化算法的方法很大程度上依赖于连接并需要多台计算机的网络通信协议。解决此问题的方法之一是利用当前多核硬件中的固有功能在一台计算机的不同内核之间分配任务。结果我们基于事务存储器的设计原理，针对基因表达微阵列类型数据和SNP分类数据，引入了k均值和k模式聚类算法的多核并行化。我们新的共享内存并行算法显示出很高的效率。我们展示了它们的计算能力，并展示了它们在群集稳定性和灵敏度分析中的实用性，这些参数通过使用参数略有更改的重复运行来进行。与单核实现和最近发布的基于网络的并行化相比，基于Java的算法的大型数据集的计算速度提高了10倍，同时保留了计算精度。结论大多数台式机甚至笔记本电脑至少提供双核处理器。我们的多核算法表明，使用现代算法概念，并行化甚至可以在实验室计算机上执行诸如群集敏感性和群集数目估计之类的艰巨任务。

著录项

来源
《BMC Bioinformatics》 |2010年第1期|共页
作者
Johann M Kraus; Hans A Kestler;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets [J] . Sharma A, Podolsky R, Zhao J, Bioinformatics . 2009,第9期

机译：改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类
2. A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets [J] . Ashok Sharma1 Robert Podolsky12 Jieping Zhao1 and Richard A. McIndoe13* Bioinformatics . 2009,第9期

机译：改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类
3. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. [J] . Loewenstein Y, Portugaly E, Fromer M, Bioinformatics . 2008,第13期

机译：高效的算法，可对庞大的数据集进行精确的层次聚类：处理整个蛋白质空间。
4. GABoost: A Clustering Based Undersampling Algorithm for Highly Imbalanced Datasets Using Genetic Algorithm [C] . O. A. Ajilisa, V. P. Jagathyraj, M. K. Sabu International Conference on Innovations in Bio-Inspired Computing and Applications . 2019

机译：Gaboost：基于遗传算法的高度不平衡数据集的基于聚类的underAppling算法
5. Supervised precision ordinal clustering – A human-machine learning algorithm to create accurate clusters in big datasets: Application to indiana water quality data with novel visualization techniques [D] . Singh, Sarabjit 2014

机译：有监督的有序序数聚类–一种人机学习算法，可在大型数据集中创建准确的聚类：采用新颖的可视化技术应用于印第安纳州水质数据
6. A highly efficient multi-core algorithm for clustering extremely large datasets [O] . Johann M Kraus, Hans A Kestler 2010

机译：一种高效的多核算法用于对超大型数据集进行聚类
7. A highly efficient multi-core algorithm for clustering extremely large datasets [O] . Johann M Kraus, Hans A Kestler 2010

机译：一种高效的多核算法，用于对超大型数据集进行聚类
8. Evaluation of Hierarchical Clustering Algorithms for Document Datasets. [R] . Zhao, Y., Karypis, G. 2002

机译：文档数据集的层次聚类算法评估。

A highly efficient multi-core algorithm for clustering extremely large datasets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅