首页> 外文期刊>Journal of environment informatics >Supervised Machine Learning and Heuristic Algorithms for Outlier Detection in Irregular Spatiotemporal Datasets
【24h】

Supervised Machine Learning and Heuristic Algorithms for Outlier Detection in Irregular Spatiotemporal Datasets

机译:不规则时空数据集中的监督机器学习和启发式算法

获取原文
获取原文并翻译 | 示例

摘要

A central problem in time series analysis is the detection of outliers, with further complications presented by irregular time series data measured having spatiotemporal components. This paper presents one Heuristic and two Supervised Machine Learning algorithms for the detection of outliers in this context in univariate time series data, with comparison of results to Chen and Liu's (1993) automatic outlier detection methodology. Due to the recent trend of set up of large environmental databases across many states in the US and around the world, which allow submission of pollutant measurement data from virtually any source, these procedures are applied to the measurements of various surface water pollutants in the California Environmental Data Exchange Network (CEDEN) for understanding and exploring the viability of such databases and the proposed methods. The proposed methodologies though not as robust, give similar results to existing methodologies given the nature of the data, but can be far less time intensive to implement providing interesting insights into the database. Thus, the algorithms presented can be widely used with minimal computing resource requirements with very tractable results even with very large datasets. The methodologies have wide applicability in a variety of contexts and a wide variety of databases with similar measurement challenges across many disciplines, specifically in the environmental setting. In particular, the results have large potential regulatory impact on accepted levels of different pollutants in California water bodies, as well as the amounts to be charged for industrial discharge into those water bodies, and is intended to provide direction for further research and regulatory investments. Based on the results it seems reasonable to assume that there is further room for the inclusion of nongovernmental agency pollutant measurements in the debate of environmental pollution, specifically in California. However, the results also indicate that the use of such databases in a more inclusive way for regulatory matters must be carefully evaluated on an individualized basis. That is to ensure that poorly collected/handled measurements, do not inundate the database over and above those collected with more rigor, thus potentially making inference on the true population distribution of the pollutants more difficu being especially relevant for those pollutant measurements, which require more delicate sampling procedures.
机译:时间序列分析中的一个中心问题是离群值的检测,而具有时空分量的不规则时间序列数据则进一步带来了复杂性。本文提出了一种启发式算法和两种监督机器学习算法,用于在单变量时间序列数据中检测异常情况,并将结果与​​Chen和Liu(1993)的自动异常值检测方法进行了比较。由于最近在美国和世界各地的许多州建立大型环境数据库的趋势,使得可以从几乎任何来源提交污染物测量数据,因此这些程序适用于加利福尼亚州各种地表水污染物的测量环境数据交换网络(CEDEN),用于了解和探索此类数据库和拟议方法的可行性。所提出的方法虽然不那么健壮,但鉴于数据的性质,其结果与现有方法相似,但实现对数据库的有趣见解所需的时间却少得多。因此,即使对于非常大的数据集,所提出的算法也可以以最少的计算资源需求被广泛使用,并具有非常易于处理的结果。这些方法在各种情况下具有广泛的适用性,并且在许多学科中,特别是在环境环境中,具有类似测量挑战的各种数据库。特别是,该结果对加州水体中各种污染物的可接受水平以及向这些水体中工业排放的收费量具有巨大的潜在监管影响,旨在为进一步的研究和监管投资提供指导。根据结果​​,似乎可以合理地假设,在有关环境污染的辩论中,特别是在加利福尼亚州,将非政府机构的污染物测量包括在内还存在进一步的空间。但是,结果还表明,必须在个性化的基础上仔细评估以更广泛的方式将此类数据库用于管理事务。这是为了确保收集/处理不当的测量结果不会使数据库更加严格地收集数据,从而可能难以推断出污染物的真实种群分布;对于那些需要更精细采样程序的污染物测量尤其重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号