首页> 外文学位 >Active acquisition of informative training data.
【24h】

Active acquisition of informative training data.

机译:积极获取信息丰富的培训数据。

获取原文
获取原文并翻译 | 示例

摘要

The performance of a classifier built from labeled training data is highly dependent on the quality of the data. In many domains, collecting high quality training data can be labor-intensive and expensive. To solve this problem, we must determine that the examples acquired are informative. Ideally, one would gather a training data set with only relevant, non-redundant examples. Additionally, one would acquire this data efficiently, with minimal effort and resources. The time of the human aiding in data generation is precious, and we seek to utilize it wisely. By considering class proportions, this thesis makes three contributions to the process of optimizing the use of human assistance in training data creation for computer-based classifiers. First, we identify a new class of supervised learning problems, in which the process of generating data cannot be separated from the process of obtaining labels. This class of problems, which we call Active Class Selection (ACS) addresses the question: if one can collect n additional training instances, how should they be distributed with respect to class? The second and third contributions involve improving training data collection for a previously identified problem, Active Learning (AL). AL addresses a question distinct from, but related to, ACS: if one has n instances in an unlabeled pool U, which instances from U should one have a human label? We offer two methods of solving this problem. First, we demonstrate how ideas from ACS can be used to perform AL on multiclass datasets. Second, we address a largely neglected problem in AL: When should one stop labeling data because it will not increase the classifier performance? We also explore how to dynamically choose which AL method is best suited for a dataset at a given stage of AL.
机译:根据标记的训练数据构建的分类器的性能高度依赖于数据的质量。在许多领域中,收集高质量的培训数据可能是劳动密集型且昂贵的。为了解决此问题,我们必须确定所获取的示例具有参考价值。理想情况下,将只收集相关的非冗余示例的训练数据集。另外,人们将以最少的工作量和资源来有效地获取此数据。人类协助数据生成的时间是宝贵的,我们寻求明智地利用它。通过考虑班级比例,本论文对在基于计算机的分类器的训练数据创建中优化使用人工协助的过程做出了三点贡献。首先,我们确定了一类新的监督学习问题,其中生成数据的过程不能与获取标签的过程分开。我们称之为活动班选择(ACS)的此类问题解决了以下问题:如果一个人可以收集n个其他培训实例,则应如何针对班级分配它们?第二和第三贡献涉及改进针对先前发现的问题主动学习(AL)的训练数据收集。 AL提出了一个与ACS无关但又与ACS相关的问题:如果一个未标记的池U中有n个实例,那么来自U的哪个实例应该具有人工标签?我们提供两种解决此问题的方法。首先,我们演示如何将ACS的想法用于对多类数据集执行AL。其次,我们解决了AL中一个被忽略的问题:什么时候应该停止标注数据,因为它不会提高分类器的性能?我们还将探索如何在给定的AL阶段动态选择最适合数据集的AL方法。

著录项

  • 作者

    Lomasky, Rachel.;

  • 作者单位

    Tufts University.;

  • 授予单位 Tufts University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 167 p.
  • 总页数 167
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号