首页> 外文会议>International conference on pattern recognition and machine intelligence >Instance Ranking Using Data Complexity Measures for Training Set Selection
【24h】

Instance Ranking Using Data Complexity Measures for Training Set Selection

机译:使用数据复杂性度量进行训练集选择的实例排名

获取原文

摘要

A classifier's performance is dependent on the training set provided for the training. Hence training set selection holds an important place in the classification task. This training set selection plays an important role in improving the performance of the classifier and reducing the time taken for training. This can be done using various methods like algorithms, data-handling techniques, cost-sensitive methods, ensembles and so on. In this work, one of the data complexity measures, Maximum Fisher's discriminant ratio (F1), has been used to determine the good training instances. This measure discriminates any two classes using a specific feature by comparing the class means and variances. This measure in particular provides the overlap between the classes. In the first phase, Fl of the whole data set is calculated. After that, Fl using leave-one-out method is computed to rank each of the instances. Finally, the instances that lower the Fl value are all removed as a batch from the data set. According to Fl, a small value represents a strong overlap between the classes. Therefore if those instances that cause more overlap are removed then overlap will reduce further. Empirically demonstrated in this work, the efficacy of the proposed reduction algorithm (DRF1) using 4 different classifiers (Random Forest, Decision Tree-C5.0, SVM and kNN) and 6 data sets (Pima, Musk, Sonar, Winequality(R and W) and Wisconsin). The results confirm that the DRF1 leads to a promising improvement in kappa statistics and classification accuracy with the training set selection using data complexity measure. Approximately 18-50% reduction is achieved. There is a huge reduction of training time also.
机译:分类器的性能取决于为训练提供的训练集。因此,训练集选择在分类任务中占有重要地位。该训练集选择在提高分类器的性能并减少训练时间方面起着重要作用。这可以使用各种方法来完成,例如算法,数据处理技术,对成本敏感的方法,集成等等。在这项工作中,已使用数据复杂性度量之一(最大费舍尔判别率(F1))来确定良好的训练实例。此度量通过比较类别均值和方差来区分使用特定功能的任何两个类别。该措施尤其提供了类之间的重叠。在第一阶段,计算整个数据集的F1。之后,计算使用留一法的F1来对每个实例进行排名。最后,所有降低Fl值的实例将全部从数据集中删除。根据F1,较小的值表示类别之间的强重叠。因此,如果消除了导致更多重叠的情况,则重叠将进一步减少。在这项工作中通过经验证明了所提出的归约算法(DRF1)使用4个不同的分类器(Random Forest,Decision Tree-C5.0,SVM和kNN)和6个数据集(Pima,Musk,Sonar,Winequality(R和W)和威斯康星州)。结果证实,通过使用数据复杂性度量来选择训练集,DRF1导致了kappa统计数据和分类准确性的有希望的改善。减少了大约18-50%。培训时间也大大减少了。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号