首页> 外文期刊>Bioinformatics >Comparisons and validation of statistical clustering techniques for microarray gene expression data
【24h】

Comparisons and validation of statistical clustering techniques for microarray gene expression data

机译:微阵列基因表达数据的统计聚类技术的比较和验证

获取原文
获取原文并翻译 | 示例
           

摘要

Motivation: With the Advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. Results: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation.
机译:动机:随着微阵列芯片技术的出现,正在出现大数据集,其中包含在生物过程中的各个时间点同时表达数千种基因的水平。生物学家正在尝试根据其表达水平的时间模式对基因进行分组。虽然在微阵列研究中最常使用具有相关性“距离”的分层聚类(UPGMA),但在模式识别和统计文献中还有更多的聚类算法选择。目前,关于用于基于基因表达谱对基因进行分组的聚类算法的选择,似乎还没有明确的指导方针。结果:在本文中,我们考虑了六种(各种口味!)聚类算法,并在关于芽芽酵母形成孢子的众所周知的公开可用微阵列数据集和两个模拟数据集上评估了它们的性能。除其他外,我们制定了三种合理的验证策略,当存在临时观察或复制时,这些策略可以与任何聚类算法一起使用。我们使用这些验证措施来评估这六种聚类方法中的每一种。尽管“最佳”方法取决于确切的验证策略和要使用的簇数,但总体Diana似乎表现良好。有趣的是,基于相关性的层次聚类和基于模型的聚类(许多研究人员提倡的另一种方法)的性能似乎处于相反的极端,具体取决于采用的验证措施。接下来表明,基于一组精选基因,戴安娜产生的分组均值是最接近的,而UPPGA产生的均值是最远的。可用性:作者可应要求提供基于偏最小二乘聚类的S +代码。所考虑的所有其他集群方法在MASS库中都具有S +实现。作者可应要求提供用于计算验证措施的S +代码。孢子形成数据集可从http://cmgm.stanford.edu/pbrown/sporulation上公开获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号