首页> 外文期刊>Bioinformatics >A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
【24h】

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

机译:改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类

获取原文
获取原文并翻译 | 示例
       

摘要

MOTIVATION: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30,000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). AVAILABILITY: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
机译:动机:随着公开可用的微阵列实验数量的增加,跨多个实验分析超大型数据集的能力变得至关重要。需要开发一种快速的算法,并且可以在不影响群集质量的情况下群集极大的数据集。聚类是一种无监督的探索技术,适用于微阵列数据以查找相似的数据结构或表达模式。由于涉及高昂的输入/输出成本以及计算的距离矩阵较大,因此大多数聚合聚类算法在大型数据集(30,000 +基因/ 200 +阵列)上均失败。在本文中,我们提出了一种新的两阶段算法,该算法使用超平面划分与微阵列数据关联的高维空间。第一阶段基于使用层次算法的平衡迭代约简和聚类,第二阶段是常规的k均值聚类技术。该算法已在旨在对基因表达数据进行聚类的软件工具(HPCluster)中实现。我们将两阶段超平面算法的聚类结果与其他可用程序中的常规k均值算法进行了比较。因为第一阶段在单次扫描中遍历数据,所以性能和速度都大大提高了。与流行的k-means程序相比,在算法的第一阶段完成的数据缩减减少了内存需求,使我们能够对44,460个基因进行聚类而不会失败,并且显着减少了完成时间。该软件是用C#(.NET 1.1)编写的。可用性:该程序是免费提供的,可以从http://www.amdcc.org/bioinformatics/bioinformatics.aspx下载。补充信息:补充数据可从Bioinformatics在线获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号