首页> 外文会议>International conference on world wide web >Multi-Label Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages
【24h】

Multi-Label Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages

机译:具有数百万个标签的多标签学习:推荐网页的广告商出价短语

获取原文

摘要

Recommending phrases from web pages for advertisers to bid on against search engine queries is an important research problem with direct commercial impact. Most approaches have found it infeasible to determine the relevance of all possible queries to a given ad landing page and have focussed on making recommendations from a small set of phrases extracted (and expanded) from the page using NLP and ranking based techniques. In this paper, we eschew this paradigm, and demonstrate that it is possible to efficiently predict the relevant subset of queries from a large set of mon-etizable ones by posing the problem as a multi-label learning task with each query being represented by a separate label. We develop Multi-label Random Forests to tackle problems with millions of labels. Our proposed classifier has prediction costs that are logarithmic in the number of labels and can make predictions in a few milliseconds using 10 Gb of RAM. We demonstrate that it is possible to generate training data for our classifier automatically from click logs without any human annotation or intervention. We train our classifier on tens of millions of labels, features and training points in less than two days on a thousand node cluster. We develop a sparse semi-supervised multi-label learning formulation to deal with training set biases and noisy labels harvested automatically from the click logs. This formulation is used to infer a belief in the state of each label for each training ad and the random forest classifier is extended to train on these beliefs rather than the given labels. Experiments reveal significant gains over ranking and NLP based techniques on a large test set of 5 million ads using multiple metrics.
机译:推荐网页上的短语以供广告商针对搜索引擎查询竞标是一个具有直接商业影响的重要研究问题。大多数方法发现确定所有可能查询与给定广告目标页面的相关性是不可行的,并且集中于使用NLP和基于排名的技术从页面中提取(扩展)的一小部分短语中提出建议。在本文中,我们避开了这种范例,并证明通过将问题摆在一个多标签学习任务上(每个查询由一个表示)可以有效地从一大批可简化的查询集中有效地预测查询的相关子集。单独的标签。我们开发了多标签随机森林来解决数百万标签的问题。我们提出的分类器的预测成本与标签数量成对数,并且使用10 Gb RAM可以在几毫秒内做出预测。我们证明了可以从点击日志中自动为分类器生成训练数据,而无需任何人工注释或干预。在不到两天的时间里,我们在一千个节点集群上对分类器进行了数千万个标签,特征和训练点的训练。我们开发了一种稀疏的半监督多标签学习公式,以处理从点击日志中自动收集的训练集偏差和嘈杂标签。此公式用于为每个训练广告推断每个标签状态的信念,并且扩展随机森林分类器以针对这些信念(而不是给定的标签)进行训练。实验显示,在使用多种指标的500万个广告的大型测试集中,基于排名和基于NLP的技术获得了显着收益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号