首页> 外文期刊>Knowledge-Based Systems >Online feature selection for high-dimensional class-imbalanced data
【24h】

Online feature selection for high-dimensional class-imbalanced data

机译:高维类不平衡数据的在线特征选择

获取原文
获取原文并翻译 | 示例

摘要

When tackling high dimensionality in data mining, online feature selection which deals with features flowing in one by one over time, presents more advantages than traditional feature selection methods. However, in real-world applications, such as fraud detection and medical diagnosis, the data is high dimensional and highly class imbalanced, namely there are many more instances of some classes than others. In such cases of class imbalance, existing online feature selection algorithms usually ignore the small classes which can be important in these applications. It is hence a challenge to learn from high dimensional and class imbalanced data in an online manner. Motivated by this, we first formalize the problem of online streaming feature selection for class imbalanced data, and then present an efficient online feature selection framework regarding the dependency between condition features and decision classes. Meanwhile, we propose a new algorithm of Online Feature Selection based on the Dependency in K nearest neighbors, called K-OFSD. In terms of Neighborhood Rough Set theory, K-OFSD uses the information of nearest neighbors to select relevant features which can get higher separability between the majority class and the minority class. Finally, experimental studies on seven high-dimensional and class imbalanced data sets show that our algorithm can achieve better performance than traditional feature selection methods with the same numbers of features and state-of-the-art online streaming feature selection algorithms in an online manner. (C) 2017 Elsevier B.V. All rights reserved.
机译:解决数据挖掘中的高维问题时,与传统的特征选择方法相比,在线特征选择可处理随着时间流逝的一个一个地流动的特征,具有更多的优势。但是,在现实世界中的应用程序中,例如欺诈检测和医疗诊断,数据是高维且高度不平衡的,即某些类别的实例比其他类别更多。在此类类不平衡的情况下,现有的在线特征选择算法通常会忽略在这些应用中可能很重要的小类。因此,以在线方式从高维度和​​类别不平衡数据中学习是一个挑战。因此,我们首先对类别不平衡数据的在线流特征选择问题进行了形式化,然后提出了一种关于条件特征与决策类之间依存关系的有效在线特征选择框架。同时,我们提出了一种基于K个最近邻的依赖关系的在线特征选择的新算法,称为K-OFSD。根据邻域粗糙集理论,K-OFSD使用最近邻的信息来选择相关特征,这些特征可以在多数类和少数类之间获得更高的可分离性。最后,对七个高维和类不平衡数据集的实验研究表明,与具有相同数量特征的传统特征选择方法和最新在线流特征选择算法相比,我们的算法可以实现更好的性能。 (C)2017 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号