首页> 外文学位 >Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions.
【24h】

Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions.

机译:支持向量机用于不平衡数据集的主动学习,以及一种基于稳定预测的主动学习停止方法。

获取原文
获取原文并翻译 | 示例

摘要

When developing systems through the use of machine learning methods, the annotation of training data is one of the major expenses and has become a bottleneck in the development of new systems. Accordingly, the use of active learning (AL) to reduce annotation costs has recently generated considerable interest. The intuition behind AL is that giving the learner the ability to control what data is labeled will yield higher-performing models with less annotation effort.;Support Vector Machines (SVMs) are a method for learning linear classifiers that have worked well for many applications since their introduction and are now in widespread use. Accordingly, AL with SVMs (AL-SVM) is an important area to investigate. In addition to interest in AL-SVM, there has also been considerable interest in dealing effectively with the class imbalance that exists for many applications of machine learning. Class imbalance in the case of binary classification where the target examples are a small proportion of the total number of examples is considered in this dissertation. It's known from the passive learning literature that SVMs are susceptible to underperforming when there is class imbalance and that methods for addressing class imbalance can signficantly improve performance.;However, addressing imbalance during AL has received relatively little attention. One part of this dissertation explores how to effectively address class imbalance during AL-SVM. A main theme is that the process of AL creates training data that has markedly different characteristics than training data created through passive annotation and that modifying the base inference procedure used during AL to take into account the different characteristics of the actively sampled data can lead to more successful active learning. This theme will be explored in detail for the important case of AL-SVM for imbalanced datasets. Experimental results across a range of Information Extraction, Text Classification, and Named Entity Classification tasks show that the new methods, which take into account the skewed data created by the AL scenario, outperform methods which do not take into account the data skew created by the AL scenario.;In order to realize the performance gains enabled by AL, an effective method for stopping the process is critical. Stopping too early results in a lower-performing model and stopping too late results in waste of annotation effort. The second part of this dissertation presents a new stopping method based on detecting model stabilization in terms of predictions on data that does not have to be labeled. The principles behind this new method are explained and experimental results across a range of Information Extraction, Text Classification, and Named Entity Classification tasks show that the new method outperforms previous methods, filling a need for a more aggressive stopping method and providing users with more control over the behavior of automatic stopping of active learning.
机译:当通过使用机器学习方法来开发系统时,训练数据的注释是主要费用之一,并且已成为开发新系统的瓶颈。因此,最近使用主动学习(AL)来减少注释成本引起了人们的极大兴趣。 AL的直觉是,使学习者能够控制标记哪些数据将以较少的注释工作量产生性能更高的模型。支持向量机(SVM)是一种用于学习线性分类器的方法,该方法自许多应用以来就一直有效它们的介绍,现已广泛使用。因此,带有SVM的AL(AL-SVM)是一个重要的研究领域。除了对AL-SVM感兴趣外,对于有效处理许多机器学习应用中存在的类不平衡问题也引起了极大的兴趣。本文考虑了二进制分类的情况下的类别不平衡,其中目标样本只占样本总数的一小部分。从被动学习文献中可以知道,当存在类不平衡时,SVM容易表现不佳,并且解决类不平衡的方法可以显着改善性能。但是,解决AL期间的不平衡问题却很少受到关注。本文的一部分探讨了如何有效解决AL-SVM中的类不平衡问题。一个主要主题是,AL的过程创建的训练数据与通过被动注释创建的训练数据具有明显不同的特征,并且修改AL期间使用的基本推理过程以考虑到主动采样数据的不同特征会导致更多成功的主动学习。对于不平衡数据集的AL-SVM重要案例,将详细探讨该主题。在一系列信息提取,文本分类和命名实体分类任务中的实验结果表明,新方法(考虑到AL场景创建的偏斜数据)优于那些不考虑AL场景创建的偏斜数据的方法。 AL场景:为了实现AL所带来的性能提升,一种有效的停止过程的方法至关重要。太早停止将导致性能下降的模型,而太晚停止将导致注释工作的浪费。本文的第二部分提出了一种基于检测模型稳定性的新的停止方法,该方法基于对不必标记的数据的预测。解释了该新方法的原理,并在一系列信息提取,文本分类和命名实体分类任务中的实验结果表明,该新方法优于以前的方法,满足了对更主动的停止方法的需求,并为用户提供了更多的控制权自动停止主动学习的行为。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号