Cross-replication Reliability - An Empirical Approach to Interpreting Inter-rater Reliability

机译：交叉复制可靠性 - 解释帧间间可靠性的实证方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

When collecting annotations and labeled data from humans, a standard practice is to use inter-rater reliability (IRR) as a measure of data goodness (Hallgren, 2012). Metrics such as Krippendorff's alpha or Cohen's kappa are typically required to be above a threshold of 0.6 (Landis and Koch, 1977). These absolute thresholds arc unreasonable for crowd-sourced data from annotators with high cultural and training variances, especially on subjective topics. We present a new alternative to interpreting IRR that is more empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen's (1960) kappa. We call this approach the xRR framework. We opensouree a replication dataset of 4 million human judgements of facial expressions and analyze it with the proposed framework. We argue this framework can be used to measure the quality of crowdsourced dalasets.

机译：在收集诠释和来自人类的标记数据时，标准做法是使用帧间间可靠性（IRR）作为数据良好的衡量标准（Hallgren，2012）。克利普坦夫的Alpha或Cohen的Kappa等度量通常需要高于0.6（Landis和Koch，1977）的阈值。这些绝对阈值对于来自具有高文化和培训差异的注释器的人群资源，特别是在主观主题上，这绝对阈值不合理。我们提出了一个更为经验和情境化的IRR的新选择。它是基于在复制中反对基线测量的基于基准测试，其中一个是基于Cohen（1960）Kappa的新型交叉复制可靠性（XRR）测量。我们称之为XRR框架。我们OpenSouree为面部表情400万人类判断的复制数据集，并与拟议的框架分析。我们争辩框架可用于衡量众包的Dalasets的质量。

著录项

来源
《International Joint Conference on Natural Language Processing;Annual Meeting of the Association for Computational Linguistics》|2021年|7053-7065|共13页
会议地点
作者
Ka Wong; Praveen Paritosh; Lora Aroyo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. OASIS inter-rater reliability and reimbursement: a study of inter-rater reliability of the Outcome and Assessment Information Set (OASIS): its effects on the Home Health Resource Group (HHRG) and reimbursement. [J] . Shew PA, Sanders SL, Arthur N Home healthcare nurse. . 2010,第1期

机译：OASIS评估者之间的可靠性和报销：成果和评估信息集（OASIS）的评估者之间的可靠性研究：其对家庭健康资源小组（HHRG）和报销的影响。
2. Inter-rater reliability and test-retest reliability of the Performance and Fitness (PERF-FIT) test battery for children: a test for motor skill related fitness [J] . Bouwien C. M. Smits-Engelsman, Eline Smit, Rosemary Xorlanyo Doe-Asinyo, BMC Pediatrics . 2021,第1期

机译：儿童的性能和健身的帧间性可靠性和测试性可靠性（Perf-Fit）测试电池：用于电机技能相关健身的测试
3. Modularization of Pattern Differentiation in East Asian Medicine: Increasing Reliability or Ignoring Confounders? (Response to Yoshino and Watanabe re: "JACM Special Focus Issue on Challenges in Inter-Rater Reliability in Traditional Chinese Medicine: A Japanese Perspective" [J Altern Complement Med Epub ahead of print; DOI: 10.1089/acm.2019.0464]) [J] . The journal of alternative and complementary medicine: research on paradigm, practice, and policy . 2020,第5期

机译：东亚医学模式分化的模块化：增加可靠性或忽略混淆？（对Yoshino和Watanabe Re：“Jacm特别焦点关于中医中际挑战的挑战：日本视角”
4. Secondary Use of EHR: Interpreting Clinician Inter-Rater Reliability Through Qualitative Assessment [C] . Sarah MULLIN, Edwin ANAND, Shyamashree SINHA, CSHI Conference. . 2017

机译：EHR的二次使用：通过定性评估来解释临床医生间帧间可靠性
5. Inter-Rater Reliability of Dynamic Exertion Testing (EXiT) Performance among Healthy Adults [D] . Bricker, Indira Rose. 2021

机译：健康成年人动态开发检测（退出）性能的帧间间可靠性
6. Secondary Use of EHR: Interpreting Clinician Inter-Rater Reliability Through Qualitative Assessment [O] . Sarah MULLIN, Edwin ANAND, Shyamashree SINHA, -1

机译：电子病历的二次使用：通过定性评估来解释临床医生间的可靠性
7. Manual strength testing in 14 upper limb muscles: a study of inter-rater reliability. A study of the inter-rater reliability [O] . Jepsen Jørgen Riis, Laursen Lise Hedegaard, Larsen AI, 2004

机译：在14条上肢肌肉中进行手动力量测试：评估者间可靠性的研究。评价者之间的可靠性研究

Cross-replication Reliability - An Empirical Approach to Interpreting Inter-rater Reliability

摘要

著录项

相似文献

相关主题

期刊订阅