A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Sharma A; Podolsky R; Zhao J; McIndoe RA

首页> 外文期刊>Bioinformatics >A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

【24h】

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

机译：改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

MOTIVATION: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30,000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). AVAILABILITY: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

机译：动机：随着公开可用的微阵列实验数量的增加，跨多个实验分析超大型数据集的能力变得至关重要。需要开发一种快速的算法，并且可以在不影响群集质量的情况下群集极大的数据集。聚类是一种无监督的探索技术，适用于微阵列数据以查找相似的数据结构或表达模式。由于涉及高昂的输入/输出成本以及计算的距离矩阵较大，因此大多数聚合聚类算法在大型数据集（30,000 +基因/ 200 +阵列）上均失败。在本文中，我们提出了一种新的两阶段算法，该算法使用超平面划分与微阵列数据关联的高维空间。第一阶段基于使用层次算法的平衡迭代约简和聚类，第二阶段是常规的k均值聚类技术。该算法已在旨在对基因表达数据进行聚类的软件工具（HPCluster）中实现。我们将两阶段超平面算法的聚类结果与其他可用程序中的常规k均值算法进行了比较。因为第一阶段在单次扫描中遍历数据，所以性能和速度都大大提高了。与流行的k-means程序相比，在算法的第一阶段完成的数据缩减减少了内存需求，使我们能够对44,460个基因进行聚类而不会失败，并且显着减少了完成时间。该软件是用C＃（.NET 1.1）编写的。可用性：该程序是免费提供的，可以从http://www.amdcc.org/bioinformatics/bioinformatics.aspx下载。补充信息：补充数据可从Bioinformatics在线获得。

著录项

来源
《Bioinformatics》 |2009年第9期|共6页
作者
Sharma A; Podolsky R; Zhao J; McIndoe RA;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物工程学（生物技术）;
关键词

相似文献

外文文献
中文文献
专利

1. A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets [J] . Sharma A, Podolsky R, Zhao J, Bioinformatics . 2009,第9期

机译：改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类
2. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. [J] . Loewenstein Y, Portugaly E, Fromer M, Bioinformatics . 2008,第13期

机译：高效的算法，可对庞大的数据集进行精确的层次聚类：处理整个蛋白质空间。
3. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space [J] . Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, Bioinformatics . 2008,第13期

机译：高效的算法，可对庞大的数据集进行精确的层次聚类：处理整个蛋白质空间
4. A Modified Relationship Based Clustering Framework for Density Based Clustering and Outlier Filtering on High Dimensional Datasets [C] . Turgay Tugay Bilgin, A. Yilmaz Camurcu Advances in Knowledge Discovery and Data Mining; Lecture Notes in Artificial Intelligence; 4426 . 2007

机译：用于高密度数据集上基于密度的聚类和离群值过滤的基于关系的聚类改进框架
5. Supervised precision ordinal clustering – A human-machine learning algorithm to create accurate clusters in big datasets: Application to indiana water quality data with novel visualization techniques [D] . Singh, Sarabjit 2014

机译：有监督的有序序数聚类–一种人机学习算法，可在大型数据集中创建准确的聚类：采用新颖的可视化技术应用于印第安纳州水质数据
6. A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets [O] . Ashok Sharma, Robert Podolsky, Jieping Zhao, -1

机译：改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类
7. A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets [O] . Sharma, Ashok, Podolsky, Robert, Zhao, Jieping, 2009

机译：改进的超平面聚类算法允许对超大型数据集进行高效且准确的聚类
8. Evaluation of Hierarchical Clustering Algorithms for Document Datasets. [R] . Zhao, Y., Karypis, G. 2002

机译：文档数据集的层次聚类算法评估。

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

摘要

著录项

相似文献

相关主题

期刊订阅