首页> 外文期刊>Physical review, E >Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation
【24h】

Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation

机译:多序列对准深度对蛋白协变量Potts统计模型的影响

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Potts statistical models have become a popular and promising way to analyze mutational covariation in protein multiple sequence alignments (MSAs) in order to understand protein structure, function, and fitness. But the statistical limitations of these models, which can have millions of parameters and are fit to MSAs of only thousands or hundreds of effective sequences using a procedure known as inverse Ising inference, are incompletely understood. In this work we predict how model quality degrades as a function of the number of sequences N, sequence length L, amino-acid alphabet size q, and the degree of conservation of the MSA, in different applications of the Potts models: in "fitness" predictions of individual protein sequences, in predictions of the effects of single-point mutations, in "double mutant cycle" predictions of epistasis, and in 3D contact prediction in protein structure. We show how as MSA depth N decreases an "overfitting" effect occurs such that sequences in the trainingMSA have overestimated fitness, and we predict the magnitude of this effect and discuss how regularization can help correct for it, using a regularization procedure motivated by statistical analysis of the effects of finite sampling. We find that as N decreases the quality of point-mutation effect predictions degrade least, fitness and epistasis predictions degrade more rapidly, and contact predictions are most affected. However, overfitting becomes negligible for MSA depths of more than a few thousand effective sequences, as often used in practice, and regularization becomes less necessary. We discuss the implications of these results for users of Potts covariation analysis.
机译:Potts统计模型已成为分析蛋白质多序列比对(MSA)的突变协变量的流行和有希望的方法,以便理解蛋白质结构,功能和适应性。但是,这些模型的统计局限性可以具有数百万个参数,并且使用称为逆静脉公司推断的程序适用于仅数千万或数百个有效序列的MSA,被不完全理解。在这项工作中,我们预测模型质量如何在Potts模型的不同应用中作为序列N,序列长度L,氨基酸字母尺寸Q的函数和MSA的节约程度的函数来降低:在“健身”中“在单点突变的预测中,在单点突变的影响中预测,在外观的”双突变循环“预测中,在蛋白质结构中的3D接触预测中。我们展示了MSA深度n降低了“过度装箱”效果,使得培训疫苗中的序列具有高估的健身,并且我们预测了这种效果的大小,并使用统计分析的正则化过程讨论了如何有助于对其进行纠正的规范化。有限抽样的影响。我们发现,由于n降低点突变效应预测的质量,降低,健身和超越预测更快地降低,并且接触预测最受影响。然而,由于在实践中通常使用的经常使用的MSA深度超过几千个有效序列,过度装备可以忽略不计,并且正规化变得不太必要。我们讨论了这些结果对Potts Covariation分析用户的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号