...
首页> 外文期刊>Accident Analysis and Prevention >Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling
【24h】

Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling

机译:重采样方法在应对不平衡崩溃数据中的有效性:碰撞型分析和预测建模

获取原文
获取原文并翻译 | 示例
           

摘要

Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the densitybased approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space.
机译:崩溃数据分析通常遭受不平衡数据。由设施和控制类型而变化,一些碰撞类型比其他类型更频繁。然而,罕见的崩溃类型通常更严重,与更高的经济和社会成本相关,因此对预防至关重要。开发可靠地预测崩溃类型并识别归因因子的推理模型至关重要,特别是对于严重类型。目前朝着不频繁事件建模过程通常忽略数据表示中的差异,这可能导致偏置模型。因此,减轻和管理不平衡数据对于开发有意义和强大的模型至关重要,有助于揭示有效的对策。本研究致力于比较重采样技术对机器学习和经典统计模型的效果,用于在高速公路上进行分类和预测不同的崩溃类型。具体地,针对四个碰撞分类模型研究了一种具有三种流行的过采样方法(即随机过度采样,合成少数群体过度采样和自适应合成采样和自适应合成采样和自适应合成采样和自适应合成采样)的混合采样方法。 ,包括三个集合机器学习模型(Catboost,XGBoost和随机林)和一个经典统计模型(嵌套Logit)。这项研究得出结论,所有三种重采样方法都始终如一地增强了所有模型的性能。在三种过度采样方法中,自适应合成采样方法在不妨碍大多数碰撞类型的预测的情况下表现最佳和极大地改善了少数碰撞类型的预测。这可能是由于在创造与高维特征空间中体现的基础歧管结构中更加一致的合成实例的自适应合成采样的密度达到的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号