首页> 外文期刊>Bioinformatics >Model-based clustering and data transformations for gene expression data.
【24h】

Model-based clustering and data transformations for gene expression data.

机译:基因表达数据的基于模型的聚类和数据转换。

获取原文
获取原文并翻译 | 示例
       

摘要

MOTIVATION: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. RESULTS: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits. AVAILABILITY: MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development. CONTACT: kayee
机译:动机:聚类是用于分析基因表达数据的有用探索技术。在这种情况下,已经提出了许多不同的启发式聚类算法。基于概率模型的聚类算法提供了启发式算法的原则替代方案。特别地,基于模型的聚类假设数据是由基础概率分布(例如多元正态分布)的有限混合生成的。选择“好的”聚类方法和确定“正确的”聚类数量的问题减少了,以在概率框架中对选择问题进行建模。高斯混合模型已被证明是在许多应用中进行聚类的强大工具。结果:我们在可获得外部评估标准的多个合成和真实基因表达数据集上对基于模型的聚类的性能进行了基准测试。基于模型的方法在我们的综合数据集上具有卓越的性能,可以持续选择正确的模型和聚类数量。在真实表达数据上,基于模型的方法产生的聚类质量可与领先的启发式聚类算法相媲美,但具有建议聚类数量和合适模型的关键优势。我们还探讨了高斯混合假设对真实数据不同变换的有效性。我们还评估了这些真实基因表达数据集在经受常用数据转换之前和之后与多元高斯分布的拟合程度。适当选择的转换似乎会导致合理的拟合。可用性:MCLUST可从http://www.stat.washington.edu/fraley/mclust获得。对角线模型的软件正在开发中。联系人:kayee

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号