...
首页> 外文期刊>Intelligent data analysis >A two-phase hybrid of semi-supervised and active learning approach for sequence labeling
【24h】

A two-phase hybrid of semi-supervised and active learning approach for sequence labeling

机译:半监督和主动学习两阶段混合的序列标记方法

获取原文
获取原文并翻译 | 示例

摘要

In recent years, many NLP systems and tasks are developed using machine learning methods. In order to achieve the best performance, these systems are generally trained on a large human annotated corpus. Since annotating such corpora is a very expensive and time-consuming procedure, manually annotating corpora is become one of the significant issues in many text based tasks such as text mining, semantic annotation, Named Entity Recognition and generally Information Extraction. Semi-supervised Learning and Active Learning are two distinct approaches that deal with reduction of labeling costs. Based on their natures, Active and semi-supervised learning can produce better results when they are jointly applied. In this paper we propose a combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically. The proposed approach reduces manual annotation cost around 90% compare with a supervised learning and 30% in contrast with a similar fully active learning approach. Conditional Random Field (CRF) is chosen as the underlying learning model due to its promising performance in many sequence labeling tasks. In addition we proposed a confidence measure based on the model's variance reduction that reaches a considerable accuracy for finding informative samples.
机译:近年来,使用机器学习方法开发了许多NLP系统和任务。为了获得最佳性能,通常会在大型的带人类注释的语料库上训练这些系统。由于注释此类语料库是非常昂贵且耗时的过程,因此手动注释语料库已成为许多基于文本的任务(例如,文本挖掘,语义注释,命名实体识别和一般的信息提取)中的重要问题之一。半监督学习和主动学习是两种降低标签成本的独特方法。根据其性质,主动学习和半监督学习可以在结合使用时产生更好的结果。在本文中,我们提出了一种用于序列标记的半监督和主动学习相结合的方法,该方法极大地降低了手动注释的成本,其方式是只需要手动标记高度不确定的标记并自动标记其他序列和子序列。与有监督的学习相比,拟议的方法将人工注释成本降低了约90%,与类似的完全主动学习方法相比,降低了30%。由于条件随机场(CRF)在许多序列标记任务中表现良好,因此被选作基础学习模型。此外,我们提出了一种基于模型方差减少的置信度度量,该置信度度量在查找信息样本时达到了相当高的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号