【24h】

Pattern-directed aligned pattern clustering

机译:模式导向的对齐模式聚类

获取原文

摘要

Functional region identification is of fundamental importance for protein sequences analysis for a protein family. Such knowledge not only provides a better scientific understanding but also assists drug discovery. Domain annotation is one approach but it needs to leverage existing databases. For de novo discovery, motif discovery locates and aligns locally similar sub-sequences and represents them as a position-weight matrix (PWM). However, PWM is a fixed-length model whereas protein functional region size varies. Furthermore, to obtain a PWM, a width range parameter needs to be identified through exhaustive search. Hence, it is computational intensive for large dataset. This paper presents a new method known as Pattern-Directed Aligned Pattern Clustering (PD-APCn) to discover and align residues in conserved protein functional regions. It adopts Aligned Pattern Cluster (APC) as the representation model which allows variable pattern length. It uses patterns with strong support to direct the incremental expansion of the APCs, allowing substitution and frame-shift mutations, until a robust termination condition is reached. The concept of breakpoint gap is introduced to identify uncovered conserved patterns with substitution and frame-shift mutations, where these are often rare mutants. To evaluate the performance of PD-APCn, we conducted experiments on synthetic datasets with different size and noise level. Comparing with the popular motif discovery algorithm MEME, PD-APCn has demonstrated competitive performance throughout the experiments, obtaining a higher recall and F measure with up to 400× significant computational speed up comparing to MEME.
机译:功能区识别对于蛋白质家族的蛋白质序列分析至关重要。这些知识不仅可以提供更好的科学理解,而且可以帮助发现药物。域注释是一种方法,但是它需要利用现有数据库。对于从头发现,基序发现会定位并对齐局部相似的子序列,并将它们表示为位置权重矩阵(PWM)。但是,PWM是定长模型,而蛋白质功能区大小却有所不同。此外,为了获得PWM,需要通过穷举搜索来识别宽度范围参数。因此,对于大型数据集,这是计算密集型的。本文提出了一种称为模式定向比对模式聚类(PD-APCn)的新方法,以发现并比对保守的蛋白质功能区中的残基。它采用对齐模式群集(APC)作为表示模型,允许可变模式长度。它使用具有强大支持的模式来指导APC的增量扩展,允许替换和移码突变,直到达到可靠的终止条件为止。引入断点缺口的概念来鉴定具有取代和移码突变的未被发现的保守模式,其中这些突变通常是罕见的突变。为了评估PD-APCn的性能,我们对具有不同大小和噪声水平的合成数据集进行了实验。与流行的基元发现算法MEME相比,PD-APCn在整个实验中都表现出了竞争优势,与MEME相比,PD-APCn具有更高的查全率和F度量,具有高达400倍的显着计算速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号