...
首页> 外文期刊>BMC Bioinformatics >Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides
【24h】

Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides

机译:机器学习方法可替代3D轮廓法在淀粉样蛋白生成六肽分类中

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. Results We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). Conclusions We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
机译:背景淀粉样蛋白是能够形成原纤维的蛋白质。他们中许多人患有严重的疾病,例如老年痴呆症。淀粉样蛋白相关疾病的数量正在不断增加。最近的研究表明,淀粉样蛋白生成特性可能与氨基酸的短片段有关,这些短片段在暴露时会改变结构。已经通过实验发现了数百种这样的肽。目前尚无法对所有可能的氨基酸组合进行实验测试。相反,它们可以通过计算方法进行预测。 3D配置文件是基于物理化学的方法,已生成了最多的数据集-ZipperDB。但是,这在计算上要求很高。在这里,我们表明可以加速数据集的生成。提出并测试了两种提高淀粉样蛋白生成候选物分类效率的方法:简化的3D轮廓生成和机器学习方法。结果我们使用更经济的3D轮廓算法生成了一个六肽新数据集,该数据集与ZipperDB的分类重叠非常好(93.5%)。我们数据集的新部分包含1779个片段,其中204个归类为淀粉样蛋白。基于片段的能量,将具有6个残基序列及其二进制分类的数据集应用于训练机器学习方法。来自ZipperDB的一组独立序列用作测试集。最有效的方法是交替决策树和多层感知器。两种方法均获得ROC曲线下面积0.96,准确度91%,真阳性率ca。 78%,真实阴性率为95%。其他一些机器学习方法也取得了不错的成绩。计算时间从18-20 CPU小时(完整的3D配置文件)减少到0.5 CPU小时(简化的3D配置文件)到秒(机器学习)。结论我们表明,简化的轮廓生成方法相对于原始方法不会引入错误,同时提高了计算效率。我们的新数据集被证明具有代表性,足以使用简单的统计方法仅基于六个字母序列来测试产淀粉性。统计机器学习方法(例如,交替决策树和多层感知器)可以取代基于能量的分类器,其优点是大大减少了计算时间,并且简化了分析过程。此外,决策树提供了一组非常容易解释的规则。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号