...
首页> 外文期刊>SIGKDD explorations >Multi-Domain Active Learning for Text Classification
【24h】

Multi-Domain Active Learning for Text Classification

机译:文本分类的多域主动学习

获取原文
获取原文并翻译 | 示例

摘要

Active learning has been proven to be effective in reducing labeling efforts for supervised learning. However, existing active learning work has mainly focused on training models for a single domain. In practical applications, it is common to simultaneously train classifiers for multiple domains. For example, some merchant web sites (like Amazon.com) may need a set of classifiers to predict the sentiment polarity of product reviews collected from various domains (e.g., electronics, books, shoes). Though different domains have their own unique features, they may share some common latent features. If we apply active learning on each domain separately, some data instances selected from different domains may contain duplicate knowledge due to the common features. Therefore, how to choose the data from multiple domains to label is crucial to further reducing the human labeling efforts in multi-domain learning. In this paper, we propose a novel multi-domain active learning framework to jointly select data instances from all domains with duplicate information considered. In our solution, a shared subspace is first learned to represent common latent features of different domains. By considering the common and the domain specific features together, the model loss reduction induced by each data instance can be decomposed into a common part and a domain-specific part. In this way, the duplicate information across domains can be encoded into the common part of model loss reduction and taken into account when querying. We compare our method with the state-of-the-art active learning approaches on several text classification tasks: sentiment classification, newsgroup classification and email spam filtering. The experiment results show that our method reduces the human labeling efforts by 33.2%, 42.9% and 68.7% on the three tasks, respectively.
机译:主动学习已被证明可以有效减少监督学习的标签工作。但是,现有的主动学习工作主要集中在单个领域的培训模型上。在实际应用中,通常同​​时训练多个域的分类器。例如,某些商家网站(例如Amazon.com)可能需要一组分类器,以预测从各个领域(例如,电子,书籍,鞋子)收集的产品评论的情感极性。尽管不同的域具有自己的独特功能,但它们可能共享一些共同的潜在功能。如果我们分别在每个领域上应用主动学习,由于共同的功能,从不同领域中选择的某些数据实例可能包含重复的知识。因此,如何从多个域中选择要标记的数据对于进一步减少多域学习中的人工标记工作至关重要。在本文中,我们提出了一种新颖的多域主动学习框架,可以从所有域中共同选择具有重复信息的数据实例。在我们的解决方案中,首先学习了一个共享子空间来表示不同域的共同潜在特征。通过共同考虑通用特征和特定领域特征,可以将每个数据实例引起的模型损失减少分解为通用部分和特定领域部分。通过这种方式,跨域的重复信息可以被编码为减少模型损失的公共部分,并在查询时予以考虑。我们在几种文本分类任务上将我们的方法与最新的主动学习方法进行了比较:情感分类,新闻组分类和电子邮件垃圾邮件过滤。实验结果表明,我们的方法在这三个任务上分别减少了33.2%,42.9%和68.7%的人工标注工作量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号