首页> 外文期刊>ISPRS Journal of Photogrammetry and Remote Sensing >Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin
【24h】

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

机译:探索使用集合余量对大面积土地覆盖分类进行随机森林性能训练数据失衡和标签错误的问题

获取原文
获取原文并翻译 | 示例
       

摘要

Studies have demonstrated the robust performance of the ensemble machine learning classifier, random forests, for remote sensing land cover classification, particularly across complex landscapes. This study introduces new ensemble margin criteria to evaluate the performance of Random Forests (RF) in the context of large area land cover classification and examines the effect of different training data characteristics (imbalance and mislabelling) on classification accuracy and uncertainty. The study presents a new margin weighted confusion matrix, which used in combination with the traditional confusion matrix, provides confidence estimates associated with correctly and misclassified instances in the RF classification model. Landsat TM satellite imagery, topographic and climate ancillary data are used to build binary (foreston-forest) and multiclass (forest canopy cover classes) classification models, trained using sample aerial photograph maps, across Victoria, Australia. Experiments were undertaken to reveal insights into the behaviour of RF over large and complex data, in which training data are not evenly distributed among classes (imbalance) and contain systematically mislabelled instances. Results of experiments reveal that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing amounts of mislabelled training data. In general, balanced training data resulted in the lowest overall error rates for classification experiments (82.3% and 78.3% for the binary and multiclass experiments respectively). However, results of the study demonstrate that imbalance can be introduced to improve error rates of more difficult classes, without adversely affecting overall classification accuracy. (C) 2015 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
机译:研究表明,集成机器学习分类器(随机森林)对于遥感土地覆盖分类(尤其是在复杂景观中)的鲁棒性能。这项研究引入了新的集合余量标准,以评估大面积土地覆盖分类背景下的随机森林(RF)的性能,并研究了不同训练数据特征(失衡和贴错标签)对分类准确性和不确定性的影响。该研究提出了一种新的裕量加权混淆矩阵,该矩阵与传统混淆矩阵结合使用,可提供与RF分类模型中正确和错误分类的实例相关的置信度估计。 Landsat TM卫星图像,地形和气候辅助数据用于建立二元(森林/非森林)和多分类(森林冠层覆盖类别)分类模型,并通过样本航拍图在澳大利亚维多利亚州进行训练。进行实验以揭示对大型复杂数据的RF行为的见解,其中训练数据未在各类之间平均分配(失衡),并且包含系统错误标记的实例。实验结果表明,尽管RF分类器的错误率对标签错误的训练数据相对不敏感(在多类实验中,总的78.3%Kappa(没有错误标签的实例)到70.1%(每类错误标签的比例为25%),但相关水平随着错误标记的训练数据数量的增加,置信度下降的速度要比整体准确性下降的速度快。通常,平衡的训练数据导致分类实验的总体错误率最低(二元和多分类实验分别为82.3%和78.3%)。但是,研究结果表明,可以引入不平衡来提高较困难类的错误率,而不会对总体分类准确性产生不利影响。 (C)2015国际摄影测量与遥感学会(ISPRS)。由Elsevier B.V.发布。保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号