首页> 外文期刊>BMC Bioinformatics >A highly efficient multi-core algorithm for clustering extremely large datasets
【24h】

A highly efficient multi-core algorithm for clustering extremely large datasets

机译:一种高效的多核算法,用于对超大型数据集进行聚类

获取原文
获取外文期刊封面目录资料

摘要

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.
机译:背景技术近年来,由于来自微阵列和其他高通量技术的数据集快速增长,对计算生物学的计算能力的需求已经增加。这种需求可能会增加。需要对用于分析数据的标准算法(例如集群算法)进行并行化以进行快速处理。不幸的是,大多数并行化算法的方法很大程度上依赖于连接并需要多台计算机的网络通信协议。解决此问题的方法之一是利用当前多核硬件中的固有功能在一台计算机的不同内核之间分配任务。结果我们基于事务存储器的设计原理,针对基因表达微阵列类型数据和SNP分类数据,引入了k均值和k模式聚类算法的多核并行化。我们新的共享内存并行算法显示出很高的效率。我们展示了它们的计算能力,并展示了它们在群集稳定性和灵敏度分析中的实用性,这些参数通过使用参数略有更改的重复运行来进行。与单核实现和最近发布的基于网络的并行化相比,基于Java的算法的大型数据集的计算速度提高了10倍,同时保留了计算精度。结论大多数台式机甚至笔记本电脑至少提供双核处理器。我们的多核算法表明,使用现代算法概念,并行化甚至可以在实验室计算机上执行诸如群集敏感性和群集数目估计之类的艰巨任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号