首页> 外文OA文献 >Discovering Protein Functional Regions and Protein-Protein Interaction using Co-occurring Aligned Pattern Clusters
【2h】

Discovering Protein Functional Regions and Protein-Protein Interaction using Co-occurring Aligned Pattern Clusters

机译:使用并发排列的模式簇发现蛋白质功能区和蛋白质-蛋白质相互作用

摘要

Bioinformatics is a rapidly expanding field of research due to multiple recent advancements: 1) the advent of machine intelligence, 2) the increase of computing power, 3) our better understanding of the underlying biomolecular mechanisms, and 4) the drastic reduction of biosequencing cost and time. Since wet laboratory approaches to analysing the protein sequencing is still labour intensive and time consuming, more cost-effective computational approaches for analyzing protein sequences and their biochemical interactions are crucial. This is especially true when we encounter a large collection of protein sequences. Aligned Pattern CLustering (APCL), an algorithm which combines machine intelligence methodologies such as pattern recognition, pattern discovery, pattern clustering and alignment, formulated by my research group and myself, is one such technique. APCL discovers, prunes, and clusters aligned statistically significant patterns to assemble a related, or specifically, a homologous group of patterns in the form of an Aligned Pattern Cluster (APC). The APC obtained is found to correspond to statistically and functionally significant association patterns, which corresponds as conserved regions, such as binding segments within and between protein sequences as well as between Protein Transcription Factor (TF) and DNA Transcription Factor Binding Sites (TFBS) in many of our empirical experiments. While several known algorithms also exist to find functionally conserved segments in biosequences, they are less flexible and require more parameters than what APCL requires. Hence, APCL is a powerful tool to analyze biosequences. Because of its effectiveness, the usefulness of APCL is further expanded from the assist of discovering and analyzing functional regions of protein sequences to the exploration of co-occurrence of patterns on the same sequences or on interacting patterns between sequences from the discovered APCs. Two new algorithms are introduced and reported in this thesis in the exploration of 1) APCs containing patterns residing within the same biosequences and 2) APCs containing patterns residing between interacting biosequences.The first algorithm attempts to cluster APCs from APCs that share patterns on the same biosequences. It uses a co-occurrence score between APCs in a co-occurrence APC pair (two APCs containing co-occurrence patterns) to account for the proportion of biosequences of co-occurrence patterns they share against the total number of sequences containing them. Using this score as a similarity measure (or more precisely, as a co-occurring measure), we devise a Co-occurrence APC Clustering Algorithm to cluster APCs obtained from a collection of related biosequences into a Co-Occurrence Cluster of APCs abbreviated by cAPC. It is then analyzed and verified to see whether or not there are essential biological functions associating with the APCs within that cluster. Cytochrome c and ubiquitin families were analyzed in depth, and it was validated that members in the same cAPC do cover the functional regions that have essential cooperative biological functions. The second algorithm takes advantage of the effectiveness of APCL to create a protein-protein interaction (PPI) identification and prediction algorithm. PPI prediction is a hot research problem in bioinformatics and proteomic. A good number of algorithms exist. The state of the art algorithm is one which could achieve high success rate in prediction performance, but provides results that are difficult to interpret. The research in this thesis tries to overcome this hurdle. This second algorithm uses an APC-PPI score between two APCs to account for the proportion of patterns residing on two different protein sequences. This score measures how often patterns in both APCs co-occur in the sequence data of two known interacting proteins. The scores are then used to construct feature vectors to first train a learning model from the known PPI data and later used to predict the possible PPI between a protein pair. The algorithm performance was comparable to the state of the art algorithms, but provided results that are interpretable. The results from both algorithms built upon the extension of APCL in finding co-occurring patterns via co-occurrence of APCs are proved to be effective and useful since its performance in finding APCs is fast and effective. The first algorithm discovered biological insights, supported by biological literature, which are typically unable to be discovered solely through the analysis of biosequences. The second algorithm succeeded in providing accurate and descriptive PPI predictions. Hence, these two algorithms are useful in the analysis and prediction of proteins. In addition, through continued research and development to the second algorithm, it will be a powerful tool for the drug industry, as it can help find new PPI, an important step in developing new drugs for different drug targets.
机译:由于最近的多项进步,生物信息学是一个快速扩展的研究领域:1)机器智能的出现; 2)计算能力的提高; 3)我们对基本生物分子机制的更好理解;以及4)大大降低了生物测序成本和时间。由于用于分析蛋白质序列的湿实验室方法仍然是劳动密集型且耗时的,因此用于分析蛋白质序列及其生化相互作用的更具成本效益的计算方法至关重要。当我们遇到大量蛋白质序列时,尤其如此。对齐模式群集(APCL)是一种技术,它是由我的研究小组和我自己制定的,它结合了机器智能方法,例如模式识别,模式发现,模式聚类和对齐,是一种算法。 APCL发现,修剪和聚类具有统计意义的对齐模式,从而以对齐模式聚类(APC)的形式组装相关或特定的同源模式组。发现获得的APC对应于统计和功能上重要的关联模式,对应为保守区,例如蛋白质序列内和之间的结合区段,以及蛋白质中的蛋白质转录因子(TF)和DNA转录因子结合位点(TFBS)之间。我们的许多经验实验。尽管还存在几种已知的算法来查找生物序列中的功能保守片段,但它们的灵活性较差,并且比APCL所需的参数更多。因此,APCL是分析生物序列的强大工具。由于其有效性,APCL的用途从发现和分析蛋白质序列功能区的协助进一步扩展到探索相同序列上的模式共现或发现的APC的序列之间的相互作用模式方面。本论文在探索以下方面引入了两种新算法并进行了报道:1)包含位于相同生物序列内的模式的APC和2)包含位于相互作用生物序列之间的模式的APC。第一种算法尝试从共享相同模式的APC聚集APC生物序列。它使用同现APC对中的APC之间的同现分数(两个包含同现模式的APC)来说明它们共享的同现模式的生物序列在包含它们的序列总数中所占的比例。使用此分数作为相似性度量(或更准确地说,作为同时出现的度量),我们设计了同时出现APC聚类算法,将从相关生物序列集合中获得的APC聚类为cAPC的缩写,即aPC的同时出现。然后对其进行分析和验证,以查看该集群中是否存在与APC相关的基本生物学功能。深入分析了细胞色素c和泛素家族,并验证了同一cAPC中的成员确实覆盖了具有基本协同生物学功能的功能区域。第二种算法利用APCL的有效性来创建蛋白质-蛋白质相互作用(PPI)识别和预测算法。 PPI预测是生物信息学和蛋白质组学研究的热点问题。存在大量算法。最先进的算法可以在预测性能上获得很高的成功率,但是却提供了难以解释的结果。本文的研究试图克服这一障碍。第二种算法使用两个APC之间的APC-PPI分数来说明两个不同蛋白质序列上的模式比例。该分数衡量两个已知相互作用蛋白的序列数据中两个APC中的模式同时出现的频率。然后将这些分数用于构建特征向量,以首先从已知的PPI数据中训练学习模型,然后用于预测蛋白质对之间的可能的PPI。该算法的性能可与最新的算法相媲美,但是提供了可解释的结果。两种基于APCL扩展的算法的结果都被证明是有效和有用的,这两种算法都基于APCL的扩展,可以通过APC的同时出现来发现共现模式,因为它在查找APC方面的性能快速而有效。第一种算法是在生物学文献的支持下发现的生物学见解,而这些见解通常无法仅通过对生物序列的分析来发现。第二种算法成功提供了准确的描述性PPI预测。因此,这两种算法在蛋白质的分析和预测中很有用。此外,通过继续研发第二种算法,它将成为制药行业的强大工具,因为它可以帮助找到新的PPI,这是针对不同药物目标开发新药物的重要一步。

著录项

  • 作者

    Fung Sanderz;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号