【24h】

Sembler: Ensembling Crowd Sequential Labeling for Improved Quality

机译:组装者:组装人群顺序标签以提高质量

获取原文

摘要

Many natural language processing tasks, such as named entity recognition (NER), part of speech (POS) tagging, word segmentation, and etc., can be formulated as sequential data labeling problems. Building a sound la-beler requires very large number of correctly labeled training examples, which may not always be possible. On the other hand, crowdsourcing provides an inexpensive yet efficient alternative to collect manual sequential labeling from non-experts. However the quality of crowd labeling cannot be guaranteed, and three kinds of errors are typical: (1) incorrect annotations due to lack of expertise (e.g., labeling gene names from plain text requires corresponding domain knowledge); (2) ignored or omitted annotations due to carelessness or low confidence; (3) noisy annotations due to cheating or vandalism. To correct these mistakes, we present Sembler, a statistical model for ensembling crowd sequential la-belings. Sembler considers three types of statistical information: (1) the majority agreement that proves the correctness of an annotation; (2) correct annotation that improves the credibility of the corresponding annotator; (3) correct annotation that enhances the correctness of other annotations which share similar linguistic or contextual features. We evaluate the proposed model on a real Twitter and a synthetical biological data set, and find that Sembler is particularly accurate when more than half of annotators make mistakes.
机译:许多自然语言处理任务,例如命名实体识别(NER),词性(POS)标记,分词等,都可以表述为顺序数据标记问题。建立良好的标杆需要大量正确标记的培训示例,而这可能并非总是可能的。另一方面,众包提供了一种廉价而有效的替代方法,可以从非专家那里收集手动顺序标签。但是,人群标记的质量不能得到保证,并且通常会出现三种错误:(1)由于缺乏专业知识而导致的注释不正确(例如,从纯文本标记基因名称需要相应的领域知识); (2)由于粗心或置信度低而忽略或省略了注释; (3)由于作弊或故意破坏而产生的嘈杂注解。为了纠正这些错误,我们提出了Sembler,这是一个用于统计人群顺序标签的统计模型。 Sembler考虑三种类型的统计信息:(1)证明注解正确的多数同意; (2)正确的注释,可提高相应注释者的可信度; (3)正确注释,可增强共享相似语言或上下文特征的其他注释的正确性。我们在真实的Twitter和综合的生物学数据集上评估了提出的模型,发现当一半以上的注释者犯错时,Sembler尤其准确。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号