首页> 外文期刊>Knowledge-Based Systems >EBOD: An ensemble-based outlier detection algorithm for noisy datasets
【24h】

EBOD: An ensemble-based outlier detection algorithm for noisy datasets

机译:EBOD:基于集合的噪声数据集的异常检测算法

获取原文
获取原文并翻译 | 示例

摘要

Real-world datasets often comprise outliers (e.g., due to operational error, intrinsic variability of the measurements, recording mistakes, etc.) and, hence, require cleansing as a prerequisite to any meaningful machine learning analysis. However, data cleansing is often a laborious task that requires intuition or expert knowledge. In particular, selecting an outlier detection algorithm is challenging as this choice is dataset-specific and depends on the nature of the considered dataset. These difficulties have prevented the development of a "one-fits-all"approach for the cleansing of real-world, noisy datasets. Here, we present an unsupervised, ensemble-based outlier detection (EBOD) approach that considers the union of different outlier detection algorithms, wherein each of the selected detectors is only responsible for identifying a small number of outliers that are the most obvious from their respective standpoints. The use of an ensemble of weak detectors reduces the risk of bias during outlier detection as compared to using a single detector. The optimal combination of detectors is determined by forward-backward search. By taking the example of a noisy dataset of concrete strength measurements as well as a broad collection of benchmark datasets, we demonstrate that our EBOD method systematically outperforms all alternative detectors, when used individually or in combination. Based on this new outlier detection method, we explore how data cleansing affects the complexity, training, and accuracy of an artificial neural network. (C) 2021 The Authors. Published by Elsevier B.V.
机译:现实世界数据集通常包括异常值(例如,由于操作误差,测量的内在变化,记录错误等),因此要求清洁作为任何有意义的机器学习分析的先决条件。然而,数据清洁通常是需要直觉或专业知识的艰巨任务。特别是,选择异常值检测算法是具有挑战性的,因为这种选择是特定于数据集的并且取决于所考虑的数据集的性质。这些困难阻止了开发了用于清洁现实世界,嘈杂的数据集的“一定符合所有”方法。在这里,我们介绍了一种无监督的基于集合的异常检测(EBOD)方法,其考虑不同异常检测算法的联合,其中每个所选检测器仅负责识别少量异常值,这些异常值是它们各自最明显的异常值立场。与使用单个检测器相比,使用弱探测器的集合可降低异常检测期间偏置的风险。通过前后搜索确定检测器的最佳组合。通过借鉴具体强度测量的嘈杂数据集以及广泛的基准数据集集合,我们证明我们的EBOD方法在单独使用时或组合使用时系统地优于所有替代探测器。基于这种新的异常检测方法,我们探讨了数据清洁如何影响人工神经网络的复杂性,培训和准确性。 (c)2021作者。 elsevier b.v出版。

著录项

  • 来源
    《Knowledge-Based Systems》 |2021年第14期|107400.1-107400.16|共16页
  • 作者单位

    Univ Calif Los Angeles Phys Amorphous & Inorgan Solids Lab Parislab Dept Civil & Environm Engn Los Angeles CA 90024 USA|Univ Calif Los Angeles Dept Civil & Environm Engn Lab Chem Construct Mat LC2 Los Angeles CA USA;

    Univ Calif Los Angeles Phys Amorphous & Inorgan Solids Lab Parislab Dept Civil & Environm Engn Los Angeles CA 90024 USA;

    Univ Calif Los Angeles Phys Amorphous & Inorgan Solids Lab Parislab Dept Civil & Environm Engn Los Angeles CA 90024 USA;

    Univ Calif Los Angeles Dept Civil & Environm Engn Lab Chem Construct Mat LC2 Los Angeles CA USA|Univ Calif Los Angeles Inst Carbon Management ICM Los Angeles CA 90024 USA|Univ Calif Los Angeles Dept Mat Sci & Engn Los Angeles CA 90024 USA|Univ Calif Los Angeles Calif Nanosyst Inst Los Angeles CA 90024 USA;

    Univ Calif Los Angeles Phys Amorphous & Inorgan Solids Lab Parislab Dept Civil & Environm Engn Los Angeles CA 90024 USA|Univ Calif Los Angeles Inst Carbon Management ICM Los Angeles CA 90024 USA;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Outlier detection; Data cleansing; Machine learning; Concrete strength;

    机译:异常检测;数据清洁;机器学习;混凝土强度;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号