Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

Andrew F Neuwald; Christopher J Lanczycki; Theresa K Hodges; Aron Marchler-Bauer

摘要

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26?212?066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.

机译：为了最佳性能，用于蛋白质序列/结构分析的机器学习方法通常需要输入大的多个序列对准（MSA），该对准通常使用基于查询的迭代程序（例如PSI-Blast或Jackhmmer）创建。但是，由于这些程序使用查询序列将数据库序列视为模板，因此它们可能无法检测到或可能倾向于与查询远方相关的错位序列。更一般地，由于蛋白质演化的不可预测性质，自动化MSA程序通常无法正确对齐序列。解决此问题通常需要在结构数据的光线中进行手动策略。然而，策划的MSA倾向于含有太少的序列，以用于统计基于方法的输入。我们通过公开提供一组252个策划的分层MSA（HIMSAS）来解决这些缺点，其中总共包含26个？212？066序列，以及用于从这些极大的MSA产生的程序。每个HIMSA都包括一组分层排列的MSA，包括超家族内的单个子组以及模板MSA，指定如何将每个子组MSA对准MSAS更高的层次结构。该方法的核心是MapGaps搜索程序，它使用HIMSA作为查询以使数据库序列与策划HIMSA的准确性对齐（可能是大量的）匹配的数据库序列。我们阐述了用于外切核酸酶 - 内切核酸酶 - 磷酸酶的基础植物和Pleckstrin同源域的方法。以这种方式从HIMSAS产生的一组非常大的MSA可作为深度学习的输入，大数据分析。 MapGaps，辅助程序CDD2MGS，advphylum，Purgemsa和Convertmsa以及国家生物技术信息中心的链接，在https://www.igs.uryland.edu/abs/neuwald/software/mapgaps/提供。

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

摘要

著录项

相关主题

期刊订阅