首页> 外文期刊>Brazilian Computer Society. Journal >D-Confidence: an active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions
【24h】

D-Confidence: an active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions

机译:D-Confidence:一种主动的学习策略,可在不平衡的班级分布情况下降低标签披露的复杂性

获取原文
       

摘要

In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled instances to train a classifier. In such circumstances it is common to have massive corpora where a few instances are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled instances to improve classification models. However, these techniques assume that the labeled instances cover all the classes to learn which might not be the case. Moreover, when in the presence of an imbalanced class distribution, getting labeled instances from minority classes might be very costly, requiring extensive labeling, if queries are randomly selected. Active learning allows asking an oracle to label new instances, which are selected by criteria, aiming to reduce the labeling effort. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we evaluate the performance of d-Confidence in comparison to its baseline criteria over tabular and text datasets. We provide empirical evidence that d-Confidence reduces label disclosure complexity—which we have defined as the number of queries required to identify instances from all classes to learn—when in the presence of imbalanced data.
机译:在某些分类任务中,例如与文本语料库的自动构建和维护有关的那些任务中,获得标记实例来训练分类器非常昂贵。在这种情况下,通常会出现大量语料库,其中一些实例被标记(通常是少数),而另一些则没有。半监督学习技术尝试在未标记的实例中利用固有信息来改进分类模型。但是,这些技术假定带标签的实例涵盖了所有要学习的类,而事实并非如此。而且,当类分布不平衡时,如果随机选择查询,那么从少数类中获取带标签的实例可能会非常昂贵,需要大量的标签。主动学习允许要求oracle标记由条件选择的新实例,以减少标记工作量。 D-Confidence是一种主动的学习方法,当存在不平衡的训练集时有效。在本文中,我们比较了d-Confidence与表格和文本数据集的基线标准相比的性能。我们提供的经验证据表明,当存在不平衡数据时,d-Confidence降低了标签公开的复杂性(我们已将其定义为从所有要学习的类中识别实例所需的查询数量)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号