首页> 外文会议>International Conference on Pattern Recognition and Machine Intelligence >Instance Ranking Using Data Complexity Measures for Training Set Selection
【24h】

Instance Ranking Using Data Complexity Measures for Training Set Selection

机译:实例使用数据复杂度措施排名进行培训集选择

获取原文

摘要

A classifier's performance is dependent on the training set provided for the training. Hence training set selection holds an important place in the classification task. This training set selection plays an important role in improving the performance of the classifier and reducing the time taken for training. This can be done using various methods like algorithms, data-handling techniques, cost-sensitive methods, ensembles and so on. In this work, one of the data complexity measures, Maximum Fisher's discriminant ratio (F1), has been used to determine the good training instances. This measure discriminates any two classes using a specific feature by comparing the class means and variances. This measure in particular provides the overlap between the classes. In the first phase, F1 of the whole data set is calculated. After that, F1 using leave-one-out method is computed to rank each of the instances. Finally, the instances that lower the F1 value are all removed as a batch from the data set. According to F1, a small value represents a strong overlap between the classes. Therefore if those instances that cause more overlap are removed then overlap will reduce further. Empirically demonstrated in this work, the efficacy of the proposed reduction algorithm (DRF1) using 4 different classifiers (Random Forest, Decision Tree-C5.0, SVM and kNN) and 6 data sets (Pima, Musk, Sonar, Winequality(R and W) and Wisconsin). The results confirm that the DRF1 leads to a promising improvement in kappa statistics and classification accuracy with the training set selection using data complexity measure. Approximately 18-50% reduction is achieved. There is a huge reduction of training time also.
机译:分类器的性能取决于为培训提供的培训集。因此,培训设置选择在分类任务中保持重要位置。此培训集选择在提高分类器的性能和减少培训所需的时间方面起着重要作用。这可以使用算法等各种方法来完成,数据处理技术,成本敏感方法,集合等。在这项工作中,数据复杂度措施之一,最高渔民的判别比(F1)已被用于确定良好的培训实例。通过比较类手段和差异来使用特定特征来判别任意两个类。该措施特别提供了类之间的重叠。在第一阶段,计算整个数据集的F1。之后,计算使用休假方法的F1来对每个实例进行排名。最后,降低F1值的实例全部将作为从数据集的批处理删除。根据F1,一个小值表示类之间的强重叠。因此,如果删除导致更多重叠的那些情况,则重叠将进一步减少。经验证明在这项工作中,使用4种不同的分类器(随机林,决策树-C5.0,SVM和KNN)和6个数据集(PIMA,Musk,Sonar,WineQuality(R和w)和威斯康星州)。结果证实,DRF1通过使用数据复杂度测量的训练设置选择,DRF1导致kappa统计和分类准确性的提高。减少约18-50%。还有巨大减少培训时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号