首页> 外文OA文献 >A new normalized EM algorithm for clustering gene expression data
【2h】

A new normalized EM algorithm for clustering gene expression data

机译:一种新的用于基因表达数据聚类的归一化EM算法

摘要

Microarray data clustering represents a basic exploratory tool to find groups of genes exhibiting similar expression patterns or to detect relevant classes of molecular subtypes. Among a wide range of clustering approaches proposed and applied in the gene expression community to analyze microarray data, mixture model-based clustering has received much attention to its sound statistical framework and its flexibility in data modeling. However, clustering algorithms following the model-based framework suffer from two serious drawbacks. The first drawback is that the performance of these algorithms critically depends on the starting values for their iterative clustering procedures. Additionally, they are not capable of working directly with very high dimensional data sets in the sample clustering problem where the dimension of the data is up to hundreds or thousands. The thesis focuses on the two challenges and includes the following contributions:First, the thesis introduces the statistical model of our proposed normalized Expectation Maximization (EM) algorithm followed by its clustering performance analysis on a number of real microarray data sets. The normalized EM is stable even with random initializations for its EM iterative procedure. The stability of the normalized EM is demonstrated through its performance comparison with other related clustering algorithms. Furthermore, the normalized EM is the first mixture model-based clustering approach to be capable of working directly with very high dimensional microarray data sets in the sample clustering problem, where the number of genes is much larger than the number of samples. This advantage of the normalized EM is illustrated through the comparison with the unnormalized EM (The conventional EM algorithm for Gaussian mixture model-based clustering). Besides, for experimental microarray data sets with the availability of class labels of data points, an interesting property of the convergence speed of the normalized EM with respect to the radius of the hypersphere in its corresponding statistical model is uncovered.Second, to support the performance comparison of different clusterings a new internal index is derived using fundamental concepts from information theory. This index allows the comparison of clustering approaches in which the closeness between data points is evaluated by their cosine similarity. The method for deriving this internal index can be utilized to design other new indexes for comparing clustering approaches which employ a common similarity measure.
机译:微阵列数据聚类表示一种基本的探索性工具,可以找到表现出相似表达模式的基因组或检测分子亚型的相关类别。在基因表达社区中提出并应用于分析微阵列数据的各种各样的聚类方法中,基于混合模型的聚类受到了广泛关注,其良好的统计框架及其在数据建模中的灵活性。但是,遵循基于模型的框架的聚类算法有两个严重的缺点。第一个缺点是这些算法的性能关键取决于其迭代聚类过程的起始值。此外,它们无法直接处理数据维度高达数百或数千的样本聚类问题中的高维数据集。本文着眼于这两个挑战,主要包括以下几个方面:首先,本文介绍了我们提出的归一化期望最大化算法的统计模型,然后对许多真实的微阵列数据集进行了聚类性能分析。即使对其EM迭代过程进行了随机初始化,标准化EM也是稳定的。通过将其与其他相关聚类算法的性能进行比较,可以证明归一化EM的稳定性。此外,归一化EM是第一个基于混合模型的聚类方法,能够直接处理样本聚类问题中的高维微阵列数据集,其中基因数量远大于样本数量。通过与未归一化EM(基于高斯混合模型的传统EM算法基于聚类)的比较,可以说明归一化EM的这一优势。此外,对于具有数据点分类标签的实验性微阵列数据集,还发现了归一化EM在其对应的统计模型中相对于超球面半径的收敛速度的有趣特性。比较不同的聚类,使用信息理论中的基本概念得出了一个新的内部索引。该索引允许对聚类方法进行比较,在聚类方法中,数据点之间的紧密度通过它们的余弦相似性进行评估。可以使用导出此内部索引的方法来设计其他新索引,以比较采用通用相似性度量的聚类方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号