首页> 外文学位 >Classification and knowledge discovery in protein databases.
【24h】

Classification and knowledge discovery in protein databases.

机译:蛋白质数据库中的分类和知识发现。

获取原文
获取原文并翻译 | 示例

摘要

One of the major objectives of bioinformatics in the post-genomic era is automated characterization of a large number of available protein sequences. The ultimate goal of such a characterization is detailed understanding of protein function and its complex network of interactions with other molecules in biochemical pathways. In this study we addressed several issues frequently encountered in classification and knowledge discovery in protein databases and made a step further in characterization and prediction of intrinsically disordered proteins. First, we concentrated on the problem of classification in noisy, high-dimensional, sparse, and class-imbalanced datasets. Restricting ourselves to the two-class classification framework, we put emphasis on the cases where one class (positive or minority class) is underrepresented and small, while the other class (negative or majority class) is arbitrarily large. We designed a complete classification system that includes a permutation-test based feature selection filter and then combines over-sampling of the minority class, under-sampling of the majority class, and ensemble learning to address noise and class imbalance. The best overall method was then combined with clustering and estimation of a priori class probabilities from unlabeled data into a unified system for prediction on large protein databases. Second, we studied statistical properties of protein data belonging to low-B-factor ordered regions, high-B-factor ordered regions, short intrinsically disordered regions, and long intrinsically disordered regions. We provided evidence that all four groups are distinct types of protein flexibility with the low-B-factor ordered regions being considerably different from the remaining three groups. Furthermore, amino acid compositions of the low-B-factor ordered regions, high-B-factor ordered regions, short disordered regions, and long disordered regions are all distinct and not merely quantitative differences on a continuum. Based on these differences, a predictor of high-B-factor ordered regions was constructed. Third, in addition to ordered and disordered regions, we also studied boundary regions between ordered and long disordered regions. We found specific amino-acid signals that are characteristic for the boundary regions and subsequently built a predictor of order/disorder boundaries. This predictor was then combined with a standard order/disorder predictor into a preliminary boundary-augmented model. Finally, we studied amino acid substitution patterns of intrinsically disordered proteins and constructed a new scoring system, i.e. a scoring matrix and gap penalties, that improves sequence alignments of intrinsically disordered proteins.
机译:后基因组时代的生物信息学的主要目标之一是对大量可用蛋白质序列进行自动表征。表征的最终目的是详细了解蛋白质功能及其在生化途径中与其他分子相互作用的复杂网络。在这项研究中,我们解决了蛋白质数据库分类和知识发现中经常遇到的几个问题,并在表征和预测内在无序的蛋白质方面迈出了一步。首先,我们集中讨论嘈杂,高维,稀疏和类不平衡数据集中的分类问题。将自己限制在两类分类框架中,我们重点介绍一种情况(一类(阳性或少数派)代表人数不足,人数少,而另一类(阴性或多数派)代表人数过多的情况。我们设计了一个完整的分类系统,包括一个基于置换测试的特征选择过滤器,然后结合少数群体的过度采样,多数群体的欠采样以及集成学习来解决噪声和类别不平衡问题。然后,将最佳的整体方法与从未标记的数据的聚类和先验类别概率的估计组合到一个统一的系统中,以对大型蛋白质数据库进行预测。第二,我们研究了属于低B因子有序区域,高B因子有序区域,短内在无序区域和长内在无序区域的蛋白质数据的统计特性。我们提供的证据表明,所有四个组都是不同类型的蛋白质柔韧性,其低B因子有序区域与其余三个组明显不同。此外,低B因子有序区域,高B因子有序区域,短无序区域和长无序区域的氨基酸组成都是截然不同的,而不仅仅是连续体上的定量差异。基于这些差异,构建了高B因子有序区域的预测变量。第三,除了有序和无序区域,我们还研究了有序和长无序区域之间的边界区域。我们发现了边界区域特征性的特定氨基酸信号,随后建立了有序/无序边界的预测因子。然后,将该预测变量与标准顺序/异常预测变量组合到初步的边界增强模型中。最后,我们研究了内在无序蛋白的氨基酸取代模式,并构建了一个新的评分系统,即评分矩阵和空位罚分,可改善内在无序蛋白的序列比对。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号