首页> 美国卫生研究院文献>Oxford Open >Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
【2h】

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

机译:从策划的层次比对中获得非常大且准确的蛋白质多序列比对

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at .
机译:为了获得最佳性能,用于蛋白质序列/结构分析的机器学习方法通​​常需要输入较大的多序列比对(MSA)作为输入,通常使用基于查询的迭代程序(例如PSI-BLAST或JackHMMER)创建该序列。但是,由于这些程序使用查询序列作为模板来比对数据库序列,因此它们可能无法检测或可能导致与查询远距离相关的序列不匹配。更一般而言,由于蛋白质进化的不可预测性,自动化MSA程序通常无法正确比对序列。解决此问题通常需要根据结构数据进行手动管理。但是,策划的MSA往往包含的序列太少,无法用作基于统计的方法的输入。我们通过公开发布一组252个经过整理的层次化MSA(hiMSA)来解决这些缺陷,其中包含总共26×212×066序列,以及用于从这些超大型MSA生成的程序。每个hiMSA包括一组代表超家族中各个子组的层次化MSA,以及指定如何将每个子组MSA与层次结构中较高的MSA对齐的模板MSA。这种方法的核心是MAPGAPS搜索程序,该程序使用hiMSA作为查询来比对匹配的数据库序列(可能是大量的),其精确度与已编排的hiMSA相当。我们为核酸外切酶-核酸内切酶-磷酸酶超家族和pleckstrin同源域说明了这一过程。以此方式从hiMSA生成的一组非常大的MSA可作为深度学习,大数据分析的输入。 MAPGAPS,辅助程序CDD2MGS,AddPhylum,PurgeMSA和ConvertMSA以及与国家生物技术中心的链接。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号