首页> 外文期刊>Neurocomputing >Biases in feature selection with missing data
【24h】

Biases in feature selection with missing data

机译:具有缺失数据的特征选择中的偏见

获取原文
获取原文并翻译 | 示例

摘要

Feature selection is of great importance for two possible scenarios: (1) prediction, i.e., improving (or minimally degrading) the predictions of a target variable while discarding redundant or uninformative features and (2) discovery, i.e., identifying features that are truly dependent on the target and may be genuine causes to be determined in experimental verifications (for example for the task of drug target discovery in genomics). In both cases, if variables have a large number of missing values, imputing them may lead to false positives; features that are not associated with the target become dependent as a result of imputation. In the first scenario, this may not harm prediction, but in the second one, it will erroneously select irrelevant features. In this paper, we study the risk/benefit trade-off of missing value imputation in the context of feature selection, using causal graphs to characterize when structural bias arises. Our aim is also to investigate situations in which imputing missing values may be beneficial to reduce false negatives, a situation that might arise when there is a dependency between feature and target, but the dependency is below the significance level when only complete cases are considered. However, the benefits of reducing false negatives must be balanced against the increased number of false positives. In the case of binary target variable and continuous features, the t-test is often used for univariate feature selection. In this paper, we also introduce a de-biased version of the t-test allowing us to reap the benefits of imputation, while not incurring the penalty of increasing the number of false positives. (C) 2019 Elsevier B.V. All rights reserved.
机译:特征选择对于两种可能的场景非常重要:(1)预测,即提高(或最小地降低)目标变量的预测,同时丢弃冗余或无色特征和(2)发现,即识别真正依赖的功能在目标上,可以是在实验验证中确定的真实原因(例如,用于基因组学中的药物目标发现的任务)。在这两种情况下,如果变量有大量缺失值,则抵消它们可能会导致误报;由于估算,与目标无关的特征变得依赖。在第一场景中,这可能不会危害预测,但在第二个方面,它将错误地选择无关的功能。在本文中,我们在特征选择的背景下研究缺失价值估算的风险/益处折衷,使用因果图来表征结构偏差时。我们的目的还在调查抵消缺失值可能有益的情况,以减少假否定的情况,当特征和目标之间存在依赖时可能出现的情况,但是当仅考虑完整的情况时,依赖性低于显着性水平。但是,减少假否定的益处必须与增加的误报数均衡。在二进制目标变量和连续特征的情况下,T检验通常用于单变量特征选择。在本文中,我们还介绍了一个偏见的T检验版,允许我们获得归咎的好处,同时不会产生增加误报数的惩罚。 (c)2019 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号