Understanding complex predictive models with ghost variables

Delicado Pedro; Pena Daniel

首页> 外文期刊>Test: An Official Journal of the Spanish Society of Statistics and Operations Research >Understanding complex predictive models with ghost variables

【24h】

Understanding complex predictive models with ghost variables

机译：了解具有幻影变量的复杂预测模型

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相关主题

摘要

Framed in the literature on Interpretable Machine Learning, we propose a new procedure to assign a measure of relevance to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. We propose to measure the individual relevance of each variable by comparing the predictions of the model in the test set with those obtained when the variable of interest is substituted (in the test set) by its ghost variable, defined as the prediction of this variable by using the rest of explanatory variables. In linear models it is shown that, on the one hand, the proposed measure gives similar results to leave-one-covariate-out (loco, with a lowest computational cost) and outperforms random permutations, and on the other hand, it is strongly related to the usual F-statistic measuring the significance of a variable. In nonlinear predictive models (as neural networks or random forests) the proposed measure shows the relevance of the variables in an efficient way, as shown by a simulation study comparing ghost variables with other alternative methods (including loco and random permutations, and also knockoff variables and estimated conditional distributions). Finally, we study the joint relevance of the variables by defining the relevance matrix as the covariance matrix of the vectors of effects on predictions when using every ghost variable. Our proposal is illustrated with simulated examples and the analysis of a large real data set.

机译：在可解释机器学习的文献中，我们提出了一种新的程序，用于为复杂预测模型中的每个解释变量分配相关性度量。我们假设我们有一个适合模型的训练集和一个测试集来检查其样本外性能。我们建议通过将测试集中模型的预测与感兴趣的变量（在测试集中）替换其幻影变量时获得的预测进行比较来衡量每个变量的个体相关性，幻影变量定义为使用其余解释变量对该变量的预测。在线性模型中，一方面，所提出的度量给出的结果与留一个协变量（loco，计算成本最低）相似，并且优于随机排列，另一方面，它与通常的 F 统计量密切相关，用于测量变量的显着性。在非线性预测模型（如神经网络或随机森林）中，所提出的度量以有效的方式显示了变量的相关性，如将幽灵变量与其他替代方法（包括机位和随机排列，以及仿冒变量和估计条件分布）进行比较的模拟研究所示。最后，我们通过将相关性矩阵定义为使用每个幻影变量时对预测的影响向量的协方差矩阵来研究变量的联合相关性。我们的建议通过模拟示例和对大型真实数据集的分析来说明。

著录项

来源
《Test: An Official Journal of the Spanish Society of Statistics and Operations Research》 |2023年第1期|107-145|共39页
作者
Delicado Pedro; Pena Daniel;
展开▼
作者单位

Dept Estadist & Invest Operat;

Getafe;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类计量学;概率论与数理统计;
关键词
Explainable artificial intelligence; Estimated conditional distributions; Interpretable machine learning; Knockoffs; Leave-one-covariate-out; Out-of-sample prediction; Partial correlation matrix; Random permutations;

机译：可解释的人工智能;估计的条件分布;可解释的机器学习;仿冒品;留出一个协变量;样本外预测;偏相关矩阵;随机排列;

Understanding complex predictive models with ghost variables

摘要

著录项

引文网络

相关主题

期刊订阅