Benchmarking protein classification algorithms via supervised cross-validation.

Kertesz-Farkas A; Dhir S; Sonego P; Pacurar M; Netoteia S; Nijveen H; Kuzniar A; Leunissen JA; Kocsor A; Pongor S

首页> 外文期刊>Journal of Biochemical and Biophysical Methods >Benchmarking protein classification algorithms via supervised cross-validation.

【24h】

Benchmarking protein classification algorithms via supervised cross-validation.

机译：通过监督交叉验证对蛋白质分类算法进行基准测试。

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison. The datasets are available at http://hydra.icgeb.trieste.it/benchmark.

机译：蛋白质分类算法的开发和测试受到以下事实的阻碍：蛋白质世界的特征是，成员数量，平均蛋白质大小，组内相似性等均存在极大差异的组。基于传统交叉验证的数据集（k倍），留一法等）可能无法给出关于算法将如何推广到已知蛋白类别的新型，远距离相关亚型的可靠估计。有监督的交叉验证，即根据数据库中已知子类型选择测试和训练集，已在较早时与SCOP数据库一起成功使用。我们的目标是将该原理扩展到其他数据库，并设计用于蛋白质分类的标准化基准数据集。蛋白质类别的分层分类树为设计监督的蛋白质分类交叉验证策略提供了一个简单而通用的框架。可以使用简单的图论距离在概念层次结构的各个级别设计基准数据集。选择监督抽样和随机抽样的组合来构建尺寸减小的模型数据集，适用于算法比较。我们最近建立的蛋白质分类基准集合中增加了3000多个新的分类任务，该集合目前包括蛋白质序列（包括蛋白质结构域和整个蛋白质），蛋白质结构和阅读框DNA序列数据。我们根据各种机器学习算法（例如最近邻，支持向量机，人工神经网络，随机森林和逻辑回归）进行了广泛的评估，并与比较算法BLAST，Smith-Waterman，Needleman-Wunsch结合使用，以及3D比较方法DALI和PRIDE。与随机交叉验证方案相比，所得数据集对分类器性能的估算值更低，并且在我们看来，更为真实。监督抽样和随机抽样相结合，用于构建模型数据集，适合进行算法比较。这些数据集可从http://hydra.icgeb.trieste.it/benchmark获得。

著录项

来源
《Journal of Biochemical and Biophysical Methods》 |2008年第6期|共9页
作者
Kertesz-Farkas A; Dhir S; Sonego P; Pacurar M; Netoteia S; Nijveen H; Kuzniar A; Leunissen JA; Kocsor A; Pongor S;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物化学;生物物理学;
关键词
Proteins; Classification of information; Algorithms; cross-validation; 蛋白质类; 算法;

机译：Proteins;Classification of information;Algorithms;cross-validation;蛋白质类;算法;

相似文献

外文文献
中文文献
专利

1. Benchmarking protein classification algorithms via supervised cross-validation. [J] . Kertesz-Farkas A, Dhir S, Sonego P, Journal of Biochemical and Biophysical Methods . 2008,第6期

机译：通过监督交叉验证对蛋白质分类算法进行基准测试。
2. Supervised machine learning algorithms for protein structure classification [J] . Pooja Jain, Jonathan M. Garibaldi, Jonathan D. Hirst Computational biology and chemistry . 2009,第3期

机译：监督的机器学习算法，用于蛋白质结构分类
3. RESUMEN DE TESIS DOCTORAL A Unified Methodology to Evaluate Supervised and non-supervised classification algorithms [J] . José Francisco Martínez Trinidad, Juan Luis Díaz de León Santiago, Manuel S. Lazo Cortes, Computacion y Sistemas . 2006,第4期

机译：RESUMEN DE TESIS DOCTORAL 评估有监督和无监督分类算法的统一方法
4. IP traffic classification in NFV: A benchmarking of supervised Machine Learning algorithms [C] . Juliana Vergara-Reyes, Maria Camila Martinez-Ordonez, Armando Ordonez, IEEE Colombian Conference on Communications and Computing . 2017

机译：NFV中的IP流量分类：监督式机器学习算法的基准测试
5. Benchmarks and algorithms for protein-protein docking. [D] . Hwang, Howook. 2010

机译：蛋白质-蛋白质对接的基准和算法。
6. A Benchmark Data Set to Evaluate the Illumination Robustness of Image Processing Algorithms for Object Segmentation and Classification [O] . Arif ul Maula Khan, Ralf Mikut, Markus Reischl -1

机译：基准数据集用于评估用于对象分割和分类的图像处理算法的照明鲁棒性
7. Proposal for a Unified Methodology for Evaluating Supervised and Non-supervised Classification Algorithms [O] . Salvador Godoy-Calderón, J. Fco. Martínez-Trinidad, Manuel Lazo Cortés 2006

机译：用于评估监督和非监督分类算法的统一方法的提案
8. Value Focused Thinking Applications to Supervised Pattern Classification With Extensions to Hyperspectral Anomaly Detection Algorithms. [R] . Scanland, D. E. 2015

机译：具有扩展的高光谱异常检测算法的监督模式分类的价值聚焦思维应用。

Benchmarking protein classification algorithms via supervised cross-validation.

摘要

著录项

相似文献

相关主题

期刊订阅