首页> 外文期刊>Knowledge-Based Systems >An information theoretic approach to quantify the stability of feature selection and ranking algorithms
【24h】

An information theoretic approach to quantify the stability of feature selection and ranking algorithms

机译:一种量化特征选择和排序算法稳定性的信息理论方法

获取原文
获取原文并翻译 | 示例
           

摘要

Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data by selecting the most relevant features out of the noisy, redundant and irrelevant features. A problem that arises in many of these practical applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations. We propose an information-theoretic approach based on the Jensen-Shannon divergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, feature subsets as well as the lesser studied partial ranked lists. This generalized metric quantifies the difference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties including correction for change, upper/lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearman's rank correlation and the Kuncheva's index on feature ranking and selection outcomes, respectively. Additionally, experimental validation of the proposed approach is carried out on a real-world problem of food quality assessment showing its potential to quantify stability from different perspectives. (C) 2020 Elsevier B.V. All rights reserved.
机译:特征选择是处理高维数据时的关键步骤。特别是,这些技术通过选择噪声,冗余和无关功能的最相关的功能来简化知识发现过程。其中许多实际应用中出现的问题是特征选择算法的结果不稳定。因此,数据的小变化可以产生非常不同的特征排名。评估这些方法的稳定性成为前面提到的情况中的重要问题。我们提出了一种基于Jensen-Shannon分歧的信息 - 理论方法,以量化这种稳健性。与其他稳定性措施不同,该度量适用于不同的算法结果:完整排名列表,特征子集以及较小的研究部分排名列表。这种通用度量在概率的方法之后通过相同大小的整个列表中的差异量化,并且能够更加重视出现在列表顶部的分歧。此外,它具有所需的性质,包括用于改变的校正,上/下界和确定性选择的条件。我们说明了这种稳定度量的使用以完全受控的方式生成的数据,并将其与流行度量进行比较,包括Spearman的秩相关和Kuncheva在特征排序和选择结果上的索引。此外,拟议方法的实验验证是对食品质量评估的真实问题,表明其潜力从不同的角度来量化稳定性。 (c)2020 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号