
Understanding complex predictive models with ghost variables


获取原文并翻译 | 示例


Framed in the literature on Interpretable Machine Learning, we propose a new procedure to assign a measure of relevance to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. We propose to measure the individual relevance of each variable by comparing the predictions of the model in the test set with those obtained when the variable of interest is substituted (in the test set) by its ghost variable, defined as the prediction of this variable by using the rest of explanatory variables. In linear models it is shown that, on the one hand, the proposed measure gives similar results to leave-one-covariate-out (loco, with a lowest computational cost) and outperforms random permutations, and on the other hand, it is strongly related to the usual F-statistic measuring the significance of a variable. In nonlinear predictive models (as neural networks or random forests) the proposed measure shows the relevance of the variables in an efficient way, as shown by a simulation study comparing ghost variables with other alternative methods (including loco and random permutations, and also knockoff variables and estimated conditional distributions). Finally, we study the joint relevance of the variables by defining the relevance matrix as the covariance matrix of the vectors of effects on predictions when using every ghost variable. Our proposal is illustrated with simulated examples and the analysis of a large real data set.
机译:在可解释机器学习的文献中,我们提出了一种新的程序,用于为复杂预测模型中的每个解释变量分配相关性度量。我们假设我们有一个适合模型的训练集和一个测试集来检查其样本外性能。我们建议通过将测试集中模型的预测与感兴趣的变量(在测试集中)替换其幻影变量时获得的预测进行比较来衡量每个变量的个体相关性,幻影变量定义为使用其余解释变量对该变量的预测。在线性模型中,一方面,所提出的度量给出的结果与留一个协变量(loco,计算成本最低)相似,并且优于随机排列,另一方面,它与通常的 F 统计量密切相关,用于测量变量的显着性。在非线性预测模型(如神经网络或随机森林)中,所提出的度量以有效的方式显示了变量的相关性,如将幽灵变量与其他替代方法(包括机位和随机排列,以及仿冒变量和估计条件分布)进行比较的模拟研究所示。最后,我们通过将相关性矩阵定义为使用每个幻影变量时对预测的影响向量的协方差矩阵来研究变量的联合相关性。我们的建议通过模拟示例和对大型真实数据集的分析来说明。




京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号