首页> 外文学位 >Protein sequence classification with Bayesian supervised and semi-supervised learned classifiers.
【24h】

Protein sequence classification with Bayesian supervised and semi-supervised learned classifiers.

机译:使用贝叶斯监督和半监督学习的分类器进行蛋白质序列分类。

获取原文
获取原文并翻译 | 示例

摘要

Bioinformatics, an interdisciplinary field between computer science and biology, has emerged primarily out of the need for methods to automate the analysis and annotation of newly discovered biological data. In the last decade, there has been an exponential growth in the size of gene and protein sequence repositories resulting from rapid advancements in high-throughput experimental techniques. This resulted in a huge inventory of genes with unknown function. Gene function is executed primarily at the protein level; hence, understanding the functional role of proteins in a species can yield substantial biological information about that species which may have potential applications in biomedical research. Unfortunately, experimental characterization of protein function is tedious and not feasible for all genes. Alternatively, computational methods can complement experimental efforts in annotating the vast amount of these data lacking functional and/or structural characterization.;The computational fields of machine learning and data mining continue to provide a much needed framework for numerous methods that can assist in categorizing these unannotated proteins. Supervised learning methods have played a dominant role in helping us better understand some newly discovered proteins with respect to their functional and/or structural characterizations. As with all supervised learning methods, all training data used for model induction must be labeled.;Semi-supervised learning methods, which learn from labeled and unlabeled data, can have a significant impact on this field of research due to the relatively large amount of unlabeled data available. In theory, this unlabeled data represents a large pool of untapped information that can be used to improve models based on labeled data. Unlike supervised methods, semi-supervised methods are only beginning to emerge in this field, and are often developed with restrictive requirements that make them unsuitable for analysis and characterization of large-scale sequence data.;This dissertation explores the development of supervised and semi-supervised learned classification methods designed to classify large-scale protein sequence data. A central aim of this research is to develop methods that require only the protein sequence for classification. The majority of work is based on the well-known Naive Bayes classification framework, which has been proven to perform well in the field of text classification. The parameterized, probabilistic model is developed through observing occurrences of fixed-length subsequences throughout the labeled data. Unlabeled data is used to improve the model by extending the method by incorporating the Expectation-Maximization algorithm.;Using the task of predicting the subcellular localization of a protein sequence, performance results from the supervised method show superior performance over existing methods. Moreover, the subcellular proteome of numerous eukaryotic and prokaryotic species are estimated with far greater coverage than any other method known at the time of this research. Performance results from the semi-supervised learning research will show that large repositories of unlabeled protein sequence data can indeed be used to improve predictive performance, particularly in situations where there are fewer labeled protein sequences available, and/or the data are highly unbalanced in nature. This dissertation has laid a foundation for exploration of numerous other characterizations of proteins on large-scale data.
机译:生物信息学是计算机科学与生物学之间的一个交叉学科领域,其出现主要是由于需要对新发现的生物数据进行分析和注释自动化的方法。在过去的十年中,由于高通量实验技术的快速发展,基因和蛋白质序列库的大小呈指数增长。这导致了大量功能未知的基因。基因功能主要在蛋白质水平上执行。因此,了解蛋白质在物种中的功能作用可以产生有关该物种的大量生物学信息,这些信息可能在生物医学研究中具有潜在的应用价值。不幸的是,蛋白质功能的实验表征是乏味的,并非对所有基因都可行。或者,计算方法可以补充实验工作,以注释大量缺乏功能和/或结构特征的数据。机器学习和数据挖掘的计算领域继续为许多有助于将这些方法分类的方法提供急需的框架。未注释的蛋白质。监督学习方法在帮助我们更好地了解一些新发现的蛋白质的功能和/或结构特征方面发挥了主导作用。与所有监督学习方法一样,用于模型归纳的所有训练数据都必须标记。从标记和未标记数据中学习的半监督学习方法可能会对该领域产生重大影响,因为可用无标签数据。从理论上讲,这些未标记的数据表示大量未开发的信息,可用于基于已标记的数据改进模型。与有监督方法不同,半监督方法只是在这一领域才出现,并且往往有严格的要求,使得它们不适合用于大规模序列数据的分析和表征。监督学习的分类方法,旨在对大规模蛋白质序列数据进行分类。这项研究的主要目的是开发仅需要蛋白质序列进行分类的方法。大部分工作基于著名的Naive Bayes分类框架,该框架已被证明在文本分类领域中表现良好。通过观察整个标记数据中固定长度子序列的出现,可以开发出参数化的概率模型。未标记的数据用于通过合并Expectation-Maximization算法来扩展方法来改进模型。;使用预测蛋白质序列亚细胞定位的任务,监督方法的性能结果显示出优于现有方法的性能。而且,估计许多真核生物和原核生物的亚细胞蛋白质组的覆盖范围比本研究时已知的任何其他方法都大得多。来自半监督学习研究的性能结果将显示,未标记蛋白质序列数据的大量信息库确实可以用于改善预测性能,尤其是在可用标记蛋白质序列较少和/或数据性质高度不平衡的情况下。本论文为探索大规模数据上蛋白质的许多其他表征奠定了基础。

著录项

  • 作者

    King, Brian R.;

  • 作者单位

    State University of New York at Albany.;

  • 授予单位 State University of New York at Albany.;
  • 学科 Biology Bioinformatics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 223 p.
  • 总页数 223
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号