首页> 外文期刊>Neurocomputing >The role of pertinently diversified and balanced training as well as testing data sets in achieving the true performance of classifiers in predicting the antifreeze proteins
【24h】

The role of pertinently diversified and balanced training as well as testing data sets in achieving the true performance of classifiers in predicting the antifreeze proteins

机译:有关多样化和均衡的培训以及测试数据集在实现分类器预测抗冻蛋白的真实性能方面的作用

获取原文
获取原文并翻译 | 示例

摘要

HighlightsOptimal splitting criteria for creating representative training and testing sets.Optimal representation of entire input space resulted in enhanced generalization.Framework for handling both the within class imbalance and between class imbalance.Efficient evaluation parameters used for comparison.AbstractAntifreeze proteins (AFPs) are those proteins, which inhibit the ice nucleation process and thereby enabling certain organisms to survive under sub-zero temperature habitats. AFPs are supposed to be evolved from different types of protein families to perform the unique function of antifreeze activity and turn out to be the classical example of convergent evolution. The common sequence similarity search methods have failed to predict putative AFPs due to poor sequence and structural similarity that exists among the different sub-types of AFP. The machine learning techniques are the viable alternative approaches to predict putative AFPs. In this paper, we have discussed about the criteria (like apposite feature selection, balanced data sets and complete learning) that are needed to be taken into account for successful application of machine learning methods and implemented these criteria by using a clustering procedure in order to achieve the true performance of the learning algorithms. Diversified and representative training and testing data sets are very crucial for perfect learning as well as true testing of machine learning based prediction methods for two reasons: first is that a training dataset that lacks definable subset of input patterns makes prediction of patterns belonging to this subset either difficult or unfeasible (thus resulting in incomplete learning) and secondly a testing data set that lacks definable subset of input patterns does not tell about whether this subset of patterns can be correctly predicted by the classifier or not (thus resulting in incomplete testing). Moreover, balanced training and testing data sets are equally important for achieving the true (robust) performance of classifiers because a well-balanced training set eliminates bias of the classifier toward particular class/sub-class due to over-representation or under-representation of input patterns belonging to those classes/sub-classes. We have usedK-means clustering algorithm for creating the diversified and balanced training as well as testing data sets, to overcome the shortcoming of random splitting, which cannot guarantee representative training and testing sets. The current clustering based optimal splitting criteria proved to be better than random splitting for creating training and testing set in terms of superior generalization and robust evaluation.
机译: 突出显示 用于创建代表性训练和测试集的最佳分割标准。 整个输入空间的最佳表示形式导致泛化增强。 用于处理内部类不平衡和类之间不平衡。 用于比较的有效评估参数。 < / ce:abstract> 摘要 防冻蛋白(AFP)是抑制冰核过程从而使某些生物能够在零度以下温度栖息地生存的蛋白。 AFP应该从不同类型的蛋白质家族进化而来,以发挥其独特的抗冻活性功能,并成为融合进化的经典例子。由于AFP的不同亚型之间存在较差的序列和结构相似性,常见的序列相似性搜索方法未能预测推定的AFP。机器学习技术是预测假定AFP的可行替代方法。在本文中,我们讨论了成功应用机器学习方法需要考虑的标准(如适当的特征选择,平衡的数据集和完整的学习),并通过使用聚类过程来实现这些标准,以便达到学习算法的真实性能。多样化且具有代表性的训练和测试数据集对于完美学习以及基于机器学习的预测方法的真实测试非常关键,其原因有两个:第一,缺乏输入模式可定义子集的训练数据集可预测属于该子集的模式困难或不可行(从而导致学习不完整),其次,缺少输入模式可定义子集的测试数据集无法说明分类器是否可以正确预测该模式子集(因此导致测试不完整)。此外,平衡的训练和测试数据集对于实现分类器的真实(鲁棒)性能同样重要,因为均衡的训练集可消除分类器由于过度代表或代表不足而偏向特定类别/子类别的偏见。属于那些类/子类的输入模式。我们已经使用 K -均值聚类算法来创建多样化和均衡的训练以及测试数据集,以克服随机分割的缺点,即不能保证有代表性的训练和测试集。就优越的泛化和鲁棒的评估而言,当前基于聚类的最佳分割标准被证明比随机分割要好于创建训练和测试集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号