Identifying Statistical Bias in Dataset Replication

机译：识别数据集复制中的统计偏差

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy, even after controlling for selection frequency, a human-in-the-loop measure of data quality. We show that after remeasuring selection frequencies and correcting for statistical bias, only an estimated 3.6% ± 1.5% of the original 11.7% ± 1.0% accuracy drop remains unaccounted for. We conclude with concrete recommendations for recognizing and avoiding bias in dataset replication. Code for our study is publicly available.

机译：DataSet Replication是一种有用的工具，用于评估特定基准测试中的测试精度是否有所改进，该工具对应于模型的概括能力可靠地推广。在这项工作中，我们呈现出直观但重要的方式，其中数据集复制的标准方法引入统计偏差，歪斜所产生的观察。我们研究了ImageNet-V2，模型的ImageNet数据集的复制，即使在控制选择频率，数据质量的人载量度之后也表现出显着的（11-14％）降低。我们表明，在重新测量选择频率并纠正统计偏差后，估计只有3.6％±1.5％的原始的11.7％±1.0％精度下降仍未下载。我们得出结论，具体建议，用于识别和避免数据集复制中的偏差。我们研究的代码是公开的。

著录项

来源
《International Conference on Machine Learning》|2021年|2344-3125p|共11页
会议地点
作者
Logan Engstrom; Andrew Ilyas; Shibani Santurkar; Dimitris Tsipras; Jacob Steinhardt; Aleksander Madry;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP181-53;
关键词

相似文献

外文文献
中文文献
专利

1. Optimal strategy for linkage of datasets containing a statistical linkage key and datasets with full personal identifiers [J] . Lee K Taylor, Katie Irvine, Renee Iannotti, BMC Medical Informatics and Decision Making . 2014,第1期

机译：包含统计链接键的数据集和具有完整个人识别码的数据集的最佳链接策略
2. Statistical solutions for error and bias in global citizen science datasets [J] . Bird Tomas J., Bates Amanda E., Lefcheck Jonathan S., Biological Conservation . 2014,第Null期

机译：全球公民科学数据集中误差和偏倚的统计解决方案
3. A bootstrap method for estimating bias and variance in statistical fisheries modelling frameworks using highly disparate datasets [J] . Elvarsson B. P., Taylor L., Trenkel V. M., African Journal of Marine Science . 2014,第1期

机译：一种使用高度分散的数据集估算统计渔业建模框架中偏差和方差的引导方法
4. Identifying Statistical Bias in Dataset Replication [C] . Logan Engstrom, Andrew Ilyas, Shibani Santurkar, International Conference on Machine Learning . 2021

机译：识别数据集复制中的统计偏差
5. Detecting, Quantifying, and Mitigating Bias in Malware Datasets [D] . Seymour , John Jefferson, III. 2020

机译：在恶意软件数据集中检测，量化和缓解偏差
6. Optimal strategy for linkage of datasets containing a statistical linkage key and datasets with full personal identifiers [O] . Lee K Taylor, Katie Irvine, Renee Iannotti, 2014

机译：包含统计链接键的数据集和具有完整个人识别码的数据集的最佳链接策略
7. Optimal strategy for linkage of datasets containing a statistical linkage key and datasets with full personal identifiers [O] . Lee K Taylor, Katie Irvine, Renee Iannotti, 2014

机译：包含统计链接键的数据集和具有完整个人识别码的数据集的最佳链接策略

Identifying Statistical Bias in Dataset Replication

摘要

著录项

相似文献

相关主题

期刊订阅