首页> 美国卫生研究院文献>Oxford Open >Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

【2h】

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

机译：从策划的层次比对中获得非常大且准确的蛋白质多序列比对

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at .

机译：为了获得最佳性能，用于蛋白质序列/结构分析的机器学习方法通常需要输入较大的多序列比对（MSA）作为输入，通常使用基于查询的迭代程序（例如PSI-BLAST或JackHMMER）创建该序列。但是，由于这些程序使用查询序列作为模板来比对数据库序列，因此它们可能无法检测或可能导致与查询远距离相关的序列不匹配。更一般而言，由于蛋白质进化的不可预测性，自动化MSA程序通常无法正确比对序列。解决此问题通常需要根据结构数据进行手动管理。但是，策划的MSA往往包含的序列太少，无法用作基于统计的方法的输入。我们通过公开发布一组252个经过整理的层次化MSA（hiMSA）来解决这些缺陷，其中包含总共26×212×066序列，以及用于从这些超大型MSA生成的程序。每个hiMSA包括一组代表超家族中各个子组的层次化MSA，以及指定如何将每个子组MSA与层次结构中较高的MSA对齐的模板MSA。这种方法的核心是MAPGAPS搜索程序，该程序使用hiMSA作为查询来比对匹配的数据库序列（可能是大量的），其精确度与已编排的hiMSA相当。我们为核酸外切酶-核酸内切酶-磷酸酶超家族和pleckstrin同源域说明了这一过程。以此方式从hiMSA生成的一组非常大的MSA可作为深度学习，大数据分析的输入。 MAPGAPS，辅助程序CDD2MGS，AddPhylum，PurgeMSA和ConvertMSA以及与国家生物技术中心的链接。

著录项

期刊名称 Oxford Open
作者
Andrew F Neuwald; Christopher J Lanczycki; Theresa K Hodges; Aron Marchler-Bauer;
展开▼
作者单位

展开▼
年(卷),期 -1(2020),-1
年度 -1
页码 -1
总页数 8
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments [J] . Andrew F Neuwald, Christopher J Lanczycki, Theresa K Hodges, Database . 2020,第1期

机译：获得极大且精确的蛋白质多个序列比对从策划的分层对齐进行
2. Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee [J] . Jia-Ming Chang, Paolo Di Tommaso, Jean-Fran?ois Taly, BMC Bioinformatics . 2012,第SUPPLEMENTa4期

机译：跨膜蛋白与PSI-Coffee的精确多序列比对
3. Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee [J] . Jia-Ming Chang, Paolo Di Tommaso, Jean-Fran?ois Taly, BMC Bioinformatics . 2012,第SUPPLEMENTa4期

机译：跨膜蛋白与PSI-Coffee的精确多序列比对
4. Local multiple alignment of numerical sequences: detection of subtle motifs from protein sequences and structures [C] . Tatsuya Akutsu, Katsuhisa Horimoto Workshop on Genome Informatics . 2001

机译：数值序列的局部多次对准：从蛋白质序列和结构中检测微妙的基序
5. New multiple sequence alignment approach reveals proteins structural determinants. [D] . Baino, Khaled A. 2010

机译：新的多序列比对方法揭示了蛋白质的结构决定因素。
6. PSI/TM-Coffee: a web server for fast and accurate multiple sequence alignments of regular and transmembrane proteins using homology extension on reduced databases [O] . Evan W. Floden, Paolo D. Tommaso, Maria Chatzou, 2016

机译：PSI / TM-Coffee：一种网络服务器可使用简化数据库上的同源性扩展对常规和跨膜蛋白进行快速准确的多序列比对
7. Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments [O] . Andrew F Neuwald, Christopher J Lanczycki, Theresa K Hodges, 2020

机译：获得极大且精确的蛋白质多个序列比对从策划的分层对齐进行

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

摘要

著录项

相似文献

相关主题

期刊订阅