首页> 外文会议>SIAM International Conference on Data Mining >Learning from Heterogeneous Sources via Gradient Boosting Consensus
【24h】

Learning from Heterogeneous Sources via Gradient Boosting Consensus

机译:通过梯度提升共识来从异质来源学习

获取原文

摘要

Multiple data sources containing different types of features may be available for a given task. For instance, users' profiles can be used to build recommendation systems. In addition, a model can also use users' historical behaviors and social networks to infer users' interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build effective learning models. We call this framework heterogeneous learning. In our proposed setting, data sources can include (i) non-overlapping features, (ii) non-overlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) There should be consensus among the predictions of overlapping instances (if any) from different data sources; (2) Connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number recognition and terrorist attack detection tasks. We observe that the proposed model can improve out-of-sample error rate by as much as 80%.
机译:包含不同类型特征的多个数据源可用于给定任务。例如,用户的配置文件可用于构建推荐系统。此外,模型还可以使用用户的历史行为和社交网络推断用户对相关产品的兴趣。我们认为,希望共同使用任何可用的多个异构数据来源来构建有效的学习模型。我们称之为这个框架异构学习。在我们提出的设置中,数据源可以包括(i)非重叠功能,(ii)连接实例的非重叠实例,(iii)多个网络(即图表)。在本文中,我们提出了一种用于异构学习的一般优化框架,并从梯度提升设计了相应的学习模型。这个想法是最大限度地减少两个约束的经验损失:(1)从不同数据源的重叠实例(如果有的话)预测应该有共识; (2)图数据集中的连接实例可能具有类似的预测。目标函数通过随机梯度升压树解决。此外,重量策略旨在强调信息性数据来源,并对嘈杂的数据来说阐义。我们正式证明,所提出的策略导致更严格的误差。这种方法始终优于电影评级预测,数字识别和恐怖主义攻击检测任务的数据源标准串联。我们观察到,所提出的模型可以将样品超出样本误差率高达80%。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号