...
首页> 外文期刊>International Journal of Statistics and Probability >A New Algorithm for Detecting Outliers in Linear Regression
【24h】

A New Algorithm for Detecting Outliers in Linear Regression

机译:一种检测线性回归中异常值的新算法

获取原文
           

摘要

In this paper, we present a new algorithm for detecting multiple outliers in linear regression. The algorithm is based on a non-iterative robust covariance matrix and concentration steps used in LTS estimation. A robust covariance matrix is constructed to calculate Mahalanobis distances of independent variables which are then used as weights in weighted least squares estimation. A few concentration steps are then performed using the observations that have smallest residuals. We generate random data sets for $n=10^3, 10^4, 10^5$ and $p=5,10$ to show up the capabilities of the algorithm. In our Monte Carlo simulations, it is shown that our algorithm has very low masking and swamping ratios when the number of observations is up to $10^4$ in the case of maximum contamination in X-Space. It is also shown that, the algorithm is successful in the case of Y-Space outliers when the contamination level, sample size and number of parameters are up to $30%$, $n=10^5$, and $p=10$, respectively. Bias, variance and MSE statistics are calculated for different scenarios. The reported computation time of our implementation is quite short. It is concluded that the presented algorithm is suitable and applicable for detecting multiple outliers in regression analysis with its small masking and swamping ratios, accurate estimates of regression parameters except the intercept, and short computation time in large data sets and high level of contamination. A future work is required for reducing bias and variance of the intercept estimator in the model.
机译:在本文中,我们提出了一种用于检测线性回归中多个异常值的新算法。该算法基于LT估计中使用的非迭代强大的协方差矩阵和集中步骤。构建强大的协方差矩阵以计算独立变量的Mahalanobis距离,然后在加权最小二乘估计中用作重量。然后使用具有最小残留物的观察结果进行几个浓度步骤。我们为$ n = 10 ^ 3,10 ^ 4,10 ^ 5 $和$ p = 5,10 $生成随机数据集,以显示算法的功能。在我们的蒙特卡罗模拟中,显示在X空间最大污染的情况下观察的次数高达10 ^ 4美元时,我们的算法具有非常低的掩蔽和雨水比率。还示出了,算法在Y空间异常值的情况下成功,当污染水平时,样本大小和参数数量高达30 %$,$ n = 10 ^ 5 $,$ p = 10 $分别。针对不同场景计算偏差,方差和MSE统计数据。我们实施的报告计算时间很短。得出结论是,所提出的算法适用于,适用于检测回归分析中的多个异常值,其小屏蔽和沼泽比,除截距之外的回归参数的准确估计,以及大数据集中的短计算时间和高污染。未来的工作是在模型中减少拦截估计器的偏差和方差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号