首页> 外文会议>International Conference on Machine Learning >Identifying Statistical Bias in Dataset Replication
【24h】

Identifying Statistical Bias in Dataset Replication

机译:识别数据集复制中的统计偏差

获取原文

摘要

Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy, even after controlling for selection frequency, a human-in-the-loop measure of data quality. We show that after remeasuring selection frequencies and correcting for statistical bias, only an estimated 3.6% ± 1.5% of the original 11.7% ± 1.0% accuracy drop remains unaccounted for. We conclude with concrete recommendations for recognizing and avoiding bias in dataset replication. Code for our study is publicly available.
机译:DataSet Replication是一种有用的工具,用于评估特定基准测试中的测试精度是否有所改进,该工具对应于模型的概括能力可靠地推广。 在这项工作中,我们呈现出直观但重要的方式,其中数据集复制的标准方法引入统计偏差,歪斜所产生的观察。 我们研究了ImageNet-V2,模型的ImageNet数据集的复制,即使在控制选择频率,数据质量的人载量度之后也表现出显着的(11-14%)降低。 我们表明,在重新测量选择频率并纠正统计偏差后,估计只有3.6%±1.5%的原始的11.7%±1.0%精度下降仍未下载。 我们得出结论,具体建议,用于识别和避免数据集复制中的偏差。 我们研究的代码是公开的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号