首页> 外文期刊>Environment systems & decisions >Active learning in automated text classification: a case study exploring bias in predicted model performance metrics
【24h】

Active learning in automated text classification: a case study exploring bias in predicted model performance metrics

机译:自动文本分类中的主动学习:探讨预测模型性能度量的案例研究

获取原文
获取原文并翻译 | 示例
           

摘要

Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. "Active" machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method's potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.
机译:机器学习已成为一种经济高效的创新,以支持人类健康风险评估和其他背景下的系统文学审查。监督机器学习方法依赖于训练数据集,这是一个相对较少的文档,具有人为的人注释标签,指示他们的主题,构建自动对更大的未分类文档进行分类的模型。已经提出了“主动”机器学习作为一种方法,它限制了通过交互和顺序地关注培训最具信息丰富的文件来创建培训数据集的成本。我们使用与化学砷相关的科学文学的数据集进行了主动学习。该数据集以前由主题专家注释,关于与毒理学和风险评估有关的两个主题的相关性。我们检查替代采样方法的性能,以便顺序扩展训练数据集,特别是在寻找基于不确定性的采样和基于概率的采样。我们发现,虽然与随机抽样相比,虽然这种主动学习方法可能会降低训练数据集大小,但是在活动学习中的模型性能的预测可能遭受统计偏差,否定了该方法的潜在益处。我们讨论方法以及可以补偿偏斜采样产生的偏差的程度。我们为在模型性能指标的准确性不是关键的上下文中提出了一个有用的作用,其中模型性能度量的准确性并不重要,并且在那里有利于迅速创建均衡的训练数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号