首页> 外文期刊>自动化学报(英文版) >An Embedded Feature Selection Method for Imbalanced Data Classification
【24h】

An Embedded Feature Selection Method for Imbalanced Data Classification

机译:一种不平衡数据分类的嵌入式特征选择方法

获取原文
获取原文并翻译 | 示例
           

摘要

Imbalanced data is one type of datasets that are frequently found in real-world applications,e.g.,fraud detection and cancer diagnosis.For this type of datasets,improving the accuracy to identify their minority class is a critically important issue.Feature selection is one method to address this issue.An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class.A decision tree is a classifier that can be built up by using different splitting criteria.Its advantage is the ease of detecting which feature is used as a splitting node.Thus,it is possible to use a decision tree splitting criterion as a feature selection method.In this paper,an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed.Its comparison results with Chi2,F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected.As the number of selected features increases,our proposed method has the highest probability of achieving the best performance.The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria.Experimental results with two datasets show that ROC AUC performance can be high,even if only a few features are selected and used,and only changes slightly as more and more features are selected.However,the performance of F-measure achieves excellent performance only if 20% or more of features are chosen.The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
机译:数据不平衡是现实应用中经常发现的一种数据集,例如欺诈检测和癌症诊断。对于此类数据集,提高识别少数族裔类别的准确性是至关重要的问题。特征选择是一种方法为了解决这个问题,一种有效的特征选择方法可以选择有助于准确确定少数派类别的特征子集。决策树是可以通过使用不同的划分标准进行构建的分类器,其优势在于易于实现检测哪一个特征被用作分割节点。因此,可以将决策树分割准则用作特征选择方法。本文提出了一种使用我们提出的加权基尼指数(WGI)的嵌入式特征选择方法。与Chi2,F统计量和Gini指数特征选择方法的比较结果表明,仅选择少数特征时F统计量和Chi2达到最佳性能。选定特征的数目增加,我们提出的方法获得最佳性能的可能性最高。以接收器工作特性曲线(ROC AUC)和F-measure下的面积作为评估标准。两个数据集的实验结果表明,ROC AUC即使仅选择和使用几个功能,也可以保持较高的性能,但是随着选择的功能越来越多,其变化也很小。但是,只有选择20%或更多的功能,F-measure的性能才能达到优异的性能。这些结果有助于从业人员在面对实际问题时选择合适的特征选择方法。

著录项

  • 来源
    《自动化学报(英文版)》 |2019年第3期|703-715|共13页
  • 作者单位

    Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA;

    Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA;

    Institute of Systems Engineering, Macau University of Science and Technology, Macau 999078, China;

    Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA;

  • 收录信息 中国科学引文数据库(CSCD);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号