首页> 外文学位 >An Empirical Evaluation of Active Learning and Selective Sampling Variations Supporting Large Corpus Labeling.
【24h】

An Empirical Evaluation of Active Learning and Selective Sampling Variations Supporting Large Corpus Labeling.

机译:支持大型语料库标签的主动学习和选择性抽样变异的实证评估。

获取原文
获取原文并翻译 | 示例

摘要

A constant challenge to researchers is the lack of large and timely datasets of domain examples (research corpora) used for training and testing their latest algorithms. Corpora examples are often annotated with special labels that represent class categories, numeric predictions, etc., depending on the research problem. While acquiring large numbers of examples is often not difficult, ensuring that each is correctly and consistently labeled certainly can be. Human experts may be required to visually inspect, annotate, and cross-check each example to guarantee its accuracy. Unfortunately, the costs incurred performing this adjudication have lead to a shortage of labeled corpora, particularly bigger and more recent ones. The primary goal of our research has been to determine how larger volumes of examples could be autonomously annotated to create more substantial datasets using a minimum of human intervention while maintaining acceptable levels of labeling accuracy.;We chose a form of Machine Learning, Active Learning, as the basis for building a suite of automated corpus labeling tools. Our labelers start with a few pre-labeled examples and a larger number of unlabeled examples. They then iteratively select small batches of these examples for labeling by an "oracle", which may be a live human expert or some other authoritative source. This "selective sampling" step picks those queries which the tools themselves think would enhance their future labeling predictions. Once the labelers have been trained, the learning iterations cease and the rest of the unlabeled examples in a corpus can be confidently labeled without additional human intervention.;To sample the most informative queries we began with the well-known Uncertainty Sampling (US) technique. However, US can be computationally expensive, and so we have proposed a new variant, Approximate Uncertainty Sampling (AUS), that is nearly as effective, but which has lower complexity costs and much less processing overhead. These reductions allow AUS to select queries more frequently and support other types of computation during labeling. In this way AUS encourages the building of larger and more topical corpora for the research communities that require them.
机译:研究人员经常面临的挑战是缺少用于训练和测试其最新算法的领域实例(研究语料库)的大型及时数据库。根据研究问题,语料库示例通常带有表示类类别,数字预测等的特殊标签。虽然获取大量示例通常并不困难,但是确保每个示例正确正确地标记无疑是可以的。可能需要人类专家对每个示例进行视觉检查,注释和交叉检查,以确保其准确性。不幸的是,执行该裁决所产生的成本导致了标记语料库的短缺,特别是规模较大和较新的语料库。我们研究的主要目标是确定如何使用最少的人工干预就能自动注释更大数量的示例,以创建更大量的数据集,同时保持可接受的标签准确性。我们选择了一种形式的机器学习,主动学习,作为构建一套自动化语料库标记工具的基础。我们的贴标机以一些预先标记的示例和大量未标记的示例开始。然后,他们反复选择这些示例的小批量,以供“甲骨文”标记,这些甲骨文可能是真人专家或其他权威人士。此“选择性采样”步骤选择工具本身认为会增强其未来标签预测的那些查询。一旦对标签人员进行了培训,学习迭代就会停止,并且无需额外的人工干预就可以自信地对语料库中其余未标签的示例进行标签。;为了对信息量最大的问题进行抽样,我们从众所周知的不确定性抽样(US)技术入手。 。但是,US在计算上可能会很昂贵,因此我们提出了一种新的近似不确定性采样(AUS),它几乎一样有效,但是具有较低的复杂性成本和更少的处理开销。这些减少使AUS可以更频繁地选择查询,并在标记期间支持其他类型的计算。通过这种方式,AUS鼓励为需要他们的研究社区建立更大,更主题化的语料库。

著录项

  • 作者

    Markowitz, Theodore J.;

  • 作者单位

    Pace University.;

  • 授予单位 Pace University.;
  • 学科 Computer Science.
  • 学位 D.P.S.
  • 年度 2011
  • 页码 182 p.
  • 总页数 182
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号