首页> 外文会议>International Conference on Intelligent Systems for Molecular biology >Detecting protein sequence conservation via metric embeddings
【24h】

Detecting protein sequence conservation via metric embeddings

机译:通过度量嵌入检测蛋白质序列保护

获取原文

摘要

Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discoverclose pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler(2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison. Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similaritysearch. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster. Availability: The source of the code can be found at http://www.eecs.berkeley.edu/-eran/projects/embed.
机译:动机:比较两个蛋白质数据库是生物序列注释的根本任务。鉴于两个数据库,一个必须找到所有对蛋白质,它们具有生物学意义的替代得分矩阵下的高分,比如BLOSUM矩阵(参见Henikoff和Henikoff,1992)对齐。基于距离的方法解决这个问题数据库中的每个肽映射到一个点在度量空间,使得具有较高分数对准肽被映射到更接近于点。许多技术存在discoverclose对点的度量空间有效,但是在应用这种工作蛋白质组学比较挑战是找到的距离映射准确地编码由蛋白质组学得分矩阵由残基对之间的所有区别。布勒(2002)提出了一个这样的映射,但发现,它导致了相对低效的算法用于蛋白质 - 蛋白质比较。结果:该作品提出了一种用于将BLOSUM下肽新的距离映射矩阵,其允许更有效的similaritysearch。我们首先提出了从给定的分数矩阵衍生肽新的距离函数。然后,我们显示如何映射肽与位向量,使得任何两个肽之间的距离密切可以通过对应的位向量之间的汉明距离(即错配数)近似。我们结合这两种结果与布勒(2002年)的LSH-ALL-PAIRS-SIM算法来产生蛋白质比较改进基于距离的算法。初始执​​行该算法的表现出与原始LSH-ALL-PAIRS-SIM的5%以内的灵敏度,而较快的运行多达八倍。状况:该代码的来源可以在http://www.eecs.berkeley.edu/-eran/projects/embed找到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号