首页> 外文会议>Workshop on Evaluation and Comparison of NLP Systems >Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models
【24h】

Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models

机译:精密,召回和F1分数的概率扩展,以便更全面地评估分类模型

获取原文

摘要

In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test-sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model's behavior. We present a probabilistic extension of Precision, Recall, and Fl score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model's confidence score assignments have an impact on the system's behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics' definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.
机译:追求完美的监督NLP分类器,剃刀薄边距和低资源测试集可以使建模决策困难。流行的指标如准确性,精度和召回通常不足,因为它们未能提供模型行为的完整图像。我们介绍了精度,召回和流动得分​​的概率扩展,我们将其称为自信精确(缩放),置信度召回(Crecall)和置信-F1(CF1)。拟议的指标解决了评估大规模NLP系统时面临的一些挑战,特别是当模型的置信度分配对系统行为产生影响时。与基于阈值的对应物相比,我们描述了我们拟议的指标的四个重点效益。这些福利中的两个,我们将其称为缺少价值观和对模型信心分配的敏感性的鲁棒性来自度量标准的定义是不言而喻的;剩余的益处,泛化和功能一致性是经验证明的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号