首页> 外文期刊>Journal of applied statistics >Evaluation of robust outlier detection methods for zero-inflated complex data
【24h】

Evaluation of robust outlier detection methods for zero-inflated complex data

机译:零充气复杂数据的强大异常检测方法评估

获取原文
获取原文并翻译 | 示例
           

摘要

ABSTRACT Outlier detection can be seen as a pre-processing step for locating data points in a data sample, which do not conform to the majority of observations. Various techniques and methods for outlier detection can be found in the literature dealing with different types of data. However, many data sets are inflated by true zeros and, in addition, some components/variables might be of compositional nature. Important examples of such data sets are the Structural Earnings Survey, the Structural Business Statistics, the European Statistics on Income and Living Conditions, tax data or – as in this contribution – household expenditure data which are used, for example, to estimate the Purchase Power Parity of a country.In this work, robust univariate and multivariate outlier detection methods are compared by a complex simulation study that considers various challenges included in data sets, namely structural (true) zeros, missing values, and compositional variables. These circumstances make it difficult or impossible to flag true outliers and influential observations by well-known outlier detection methods.Our aim is to assess the performance of outlier detection methods in terms of their effectiveness to identify outliers when applied to challenging data sets such as the household expenditures data surveyed all over the world. Moreover, different methods are evaluated through a close-to-reality simulation study. Differences in performance of univariate and multivariate robust techniques for outlier detection and their shortcomings are reported. We found that robust multivariate methods outperform robust univariate methods. The best performing methods in finding the outliers and in providing a low false discovery rate were found to be the generalized S estimators (GSE), the BACON-EEM algorithm and a compositional method (CoDa-Cov). In addition, these methods performed also best when the outliers are imputed based on the corresponding outlier detection method and indicators are estimated from the data sets.
机译:摘要可以看到异常值检测作为用于定位数据样本中的数据点的预处理步骤,该数据点不符合大多数观察。可以在处理不同类型的数据的文献中找到各种技术和对异常检测方法。然而,许多数据集由真正的零充气,另外,一些组分/变量可能是组成性质。此类数据集的重要例子是结构盈利调查,结构性统计数据,欧洲收入和生活条件,税务数据或 - 与本贡献中的税收数据 - 例如用于估计购买力的家庭支出数据一个国家的奇偶校验。在这项工作中,强大的单变量和多变量异口检测方法通过复杂的仿真研究进行了比较,该研究考虑了数据集中的各种挑战,即结构(真)零,缺失值和组成变量。这些情况使得难以或不可能通过众所周知的异常检测方法来标记真正的异常值和影响力的观察。我们的目的是在应用于挑战数据集时识别异常值的效果来评估异常值检测方法的性能。家庭支出数据在世界各地调查。此外,通过近距离仿真研究评估不同的方法。报道了异常检测的单变量和多变量和多变量鲁棒技术的差异及其缺点。我们发现强大的多变量方法优于强大的单变量方法。发现在寻找异常值和提供低假发现率时的最佳性能方法是广义的S估计器(GSE),培根-EEM算法和组成方法(Coda-CoV)。此外,当基于相应的异常检测方法避阻异常值并且从数据集估计指示器时,这些方法也是最佳的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号