随着生物信息技术的快速发展,基因表达数据的规模急剧增长,这给传统的基因表达数据聚类算法带来了严峻的挑战。基于密度的层次聚类(DHC)能够较好地解决基因表达数据嵌套类问题且鲁棒性较好,但处理海量数据的效率不高。为此,提出了基于 M apReduce的密度层次聚类算法---DisD HC 。该算法首先进行数据分割,在每个子集上利用D HC进行聚类获得稀疏化的数据;在此基础上再次进行D HC聚类;最终产生整体数据的密度中心点。在酵母数据集、酵母细胞周期数据集、人血清数据集上进行实验,结果表明,DisDHC算法在保持DHC聚类效果的同时,极大地缩短了聚类时间。%The amount of gene expression data scale is increasing sharply with the rapid development of bio-informatics technology ,which poses a serious challenge for traditional clustering algorithms .Density-based hierarchical clustering (DHC) can solve the problem of the nested class of gene expression data and has good robustness , but for handling huge amounts of data . T herefore , a density-based hierarchical clustering algorithm on MapReduce(DisDHC) was proposed .It partitioned data sets into smaller blocks , clustered each block using DHC in parallel ,gathered the result for re-clustering ,and produced all density centers of each cluster .The experiments on GAL dataset ,Cell cycle dataset ,and Serum dataset show that DisDHC reduces clustering time and achieves high performance .
展开▼