首页> 外文学位 >A machine learning approach to query time-series microarray data sets for functionally related genes using hidden Markov models.
【24h】

A machine learning approach to query time-series microarray data sets for functionally related genes using hidden Markov models.

机译:一种使用隐马尔可夫模型查询功能相关基因的时间序列微阵列数据集的机器学习方法。

获取原文
获取原文并翻译 | 示例

摘要

Microarray technology captures the rate of expression of genes under varying experimental conditions. Genes encode the information necessary to build proteins; proteins used by cellular functions exhibit higher rates of expression for the associated genes. If multiple proteins are required for a particular function then their genes show a pattern of coexpression during time periods when the function is active within a cell. Cellular functions are generally complex and require groups of genes to cooperate; these groups of genes are called functional modules. Modular organization of genetic functions has been evident since 1999. Detecting functionally related genes in a genome and detecting all genes belonging to particular functional modules are current research topics in this field.;The number of microarray gene expression datasets available in public repositories increases rapidly, and advances in technology have now made it feasible to routinely perform whole-genome studies where the behavior of every gene in a genome is captured. This promises a wealth of biological and medical information, but making the amount of data accessible to researchers requires intelligent and efficient computational algorithms. Researchers working on specific cellular functions would benefit from this data if it was possible to quickly extract information useful to their area of research.;This dissertation develops a machine learning algorithm that allows one or multiple microarray data sets to be queried with a set of known and functionally related input genes in order to detect additional genes participating in the same or closely related functions. The focus is on time-series microarray datasets where gene expression values are obtained from the same experiment over a period of time from a series of sequential measurements. A feature selection algorithm selects relevant time steps where the provided input genes exhibit correlated expression behavior. Time steps are the columns in microarray data sets, rows list individual genes. A specific linear Hidden Markov Model (HMM) is then constructed to contain one hidden state for each of the selected experiments and is trained using the expression values of the input genes from the microarray.;Given the trained HMM the probability that a sequence of gene expression values was generated by that particular HMM can be calculated. This allows for the assignment of a probability score for each gene in the microarray. High-scoring genes are included in the result set (of genes with functional similarities to the input genes.) P-values can be calculated by repeating this algorithm to train multiple individual HMMs using randomly selected genes as input genes and calculating a Parzen Density Function (PDF) from the probability scores of all HMMs for each gene.;A feedback loop uses the result generated from one algorithm run as input set for another iteration of the algorithm. This iterated HMM algorithm allows for the characterization of functional modules from very small input sets and for weak similarity signals.;This algorithm also allows for the integration of multiple microarray data sets; two approaches are studied: Meta-Analysis (combination of the results from individual data set runs) and the extension of the linear HMM across multiple individual data sets. Results indicate that Meta-Analysis works best for integration of closely related microarrays and a spanning HMM works best for the integration of multiple heterogeneous datasets.;The performance of this approach is demonstrated relative to the published literature on a number of widely used synthetic data sets. Biological application is verified by analyzing biological data sets of the Fruit Fly D. Melanogaster and Baker.s Yeast S. Cerevisiae. The algorithm developed in this dissertation is better able to detect functionally related genes in common data sets than currently available algorithms in the published literature.
机译:微阵列技术可捕获各种实验条件下基因的表达速率。基因编码构建蛋白质所需的信息;细胞功能所使用的蛋白质对相关基因表现出更高的表达率。如果特定功能需要多种蛋白质,则当该功能在细胞内活跃时,它们的基因会在一段时间内显示共表达模式。细胞功能通常是复杂的,需要基因的协作。这些基因组称为功能模块。自1999年以来,遗传功能的模块化组织就很明显了。检测基因组中与功能相关的基因并检测属于特定功能模块的所有基因是该领域的当前研究课题。;公共存储库中可用的微阵列基因表达数据集的数量迅速增加,随着技术的进步,现在可以常规地进行全基因组研究,从而捕获基因组中每个基因的行为。这保证了丰富的生物学和医学信息,但是要使研究人员可以访问大量数据,就需要智能且高效的计算算法。如果能够快速提取对他们的研究领域有用的信息,从事特定细胞功能的研究人员将从这些数据中受益。;本论文开发了一种机器学习算法,该算法允许使用一组已知的信息查询一个或多个微阵列数据集和功能相关的输入基因,以便检测参与相同或紧密相关功能的其他基因。重点是时间序列微阵列数据集,其中基因表达值是在一段时间内从一系列连续测量中从同一实验中获得的。特征选择算法选择相关的时间步长,其中提供的输入基因表现出相关的表达行为。时间步长是微阵列数据集中的列,行列出了单个基因。然后构造一个特定的线性隐马尔可夫模型(HMM),以包含每个选定实验的一个隐藏状态,并使用来自微阵列的输入基因的表达值对其进行训练。可以计算该特定HMM生成的表达式值。这允许为微阵列中的每个基因分配概率得分。结果集中包括高得分基因(与输入基因功能相似的基因)。通过重复此算法,使用随机选择的基因作为输入基因训练多个单独的HMM,并计算Parzen密度函数,可以计算P值(PDF)来自每个基因的所有HMM的概率得分。反馈回路使用从一种算法生成的结果作为该算法另一次迭代的输入集。这种迭代的HMM算法可以从很小的输入集中对功能模块进行表征,并且可以对弱相似信号进行表征。该算法还可以集成多个微阵列数据集。研究了两种方法:元分析(来自单个数据集运行的结果的组合)和线性HMM在多个单个数据集上的扩展。结果表明,Meta分析最适合用于紧密相关的微阵列集成,而跨度HMM最适合用于多个异类数据集的集成。;相对于许多广泛使用的合成数据集上已发表的文献,这种方法的性能得到了证明。通过分析果蝇D. Melanogaster和Baker.s Yeast S. Cerevisiae的生物学数据来验证生物学应用。与现有文献中现有算法相比,本文开发的算法能够更好地检测公共数据集中的功能相关基因。

著录项

  • 作者

    Senf, Alexander.;

  • 作者单位

    University of Kansas.;

  • 授予单位 University of Kansas.;
  • 学科 Biology Bioinformatics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 132 p.
  • 总页数 132
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号