首页> 外文期刊>Data & Knowledge Engineering >DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets
【24h】

DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

机译:DBFS:一种有效的基于密度的特征选择方案,适用于小样本量和高维不平衡数据集

获取原文
获取原文并翻译 | 示例
       

摘要

Nowadays, imbalanced data sets are pervasive in real world human practices, and hence, become a very interesting research area within machine learning communities. Imbalanced data sets introduce a significant reduction in performance of standard classifiers when they are invoked to learn data underlying concepts. The problem becomes even more sever when imbalanced data sets are involved with high dimensions. This paper presents a novel feature ranking approach based on the probability density estimation to cope with these issues. The idea behind our approach, named Density Based Feature Selection (DBFS), is that features' distributions over classes can bring significant benefits to feature selection algorithms. In other words, to explore the contribution of each attribute and assign it an appropriate rank, DBFS takes into account features' corresponding distributions over all classes along with their correlations. To show the effectiveness of the presented approach, well-known feature ranking methods are implemented and compared with our approach across varieties of small sample size and high dimensional data sets from microarray, mass spectrometry and text mining domains. Our theoretical analysis and experimental observations reveal that our approach is the method of choice by offering a simple yet effective feature ranking method based on well-known statistical evaluation measures.
机译:如今,不平衡的数据集普遍存在于现实世界的人类实践中,因此成为机器学习社区中非常有趣的研究领域。当调用标准分类器来学习数据基础概念时,不平衡的数据集会大大降低标准分类器的性能。当高维涉及不平衡的数据集时,问题将变得更加严峻。本文提出了一种基于概率密度估计的新颖特征分级方法,以解决这些问题。我们的方法背后的想法叫“基于密度的特征选择(DBFS)”,即特征在类中的分布可以为特征选择算法带来巨大的好处。换句话说,为了探索每个属性的贡献并为其分配适当的等级,DBFS考虑了所有类中要素的对应分布及其相关性。为了展示所提出方法的有效性,我们实施了著名的特征分级方法,并将其与我们的方法进行了比较,该方法适用于小样本大小和来自微阵列,质谱和文本挖掘领域的高维数据集。我们的理论分析和实验观察表明,我们的方法是选择方法,它基于已知的统计评估方法提供了一种简单而有效的特征排名方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号