首页> 外文期刊>Computer speech and language >Semi-supervised ranking for document retrieval
【24h】

Semi-supervised ranking for document retrieval

机译:用于文件检索的半监督排序

获取原文
获取原文并翻译 | 示例

摘要

Ranking functions are an important component of information retrieval systems. Recently there has been a surge of research in the field of "learning to rank", which aims at using labeled training data and machine learning algorithms to construct reliable ranking functions. Machine learning methods such as neural networks, support vector machines, and least squares have been successfully applied to ranking problems, and some are already being deployed in commercial search engines. Despite these successes, most algorithms to date construct ranking functions in a supervised learning setting, which assume that relevance labels are provided by human annotators prior to training the ranking function. Such methods may perform poorly when human relevance judgments are not available for a wide range of queries. In this paper, we examine whether additional unlabeled data, which is easy to obtain, can be used to improve supervised algorithms. In particular, we investigate the transductive setting, where the unlabeled data is equivalent to the test data. We propose a simple yet flexible transductive meta-algorithm: the key idea is to adapt the training procedure to each test list after observing the documents that need to be ranked. We investigate two instantiations of this general framework: The Feature Generation approach is based on discovering more salient features from the unlabeled test data and training a ranker on this test-dependent feature-set. The importance weighting approach is based on ideas in the domain adaptation literature, and works by re-weighting the training data to match the statistics of each test list. We demonstrate that both approaches improve over supervised algorithms on the TREC and OHSUMED tasks from the LETOR dataset.
机译:排名功能是信息检索系统的重要组成部分。近来,在“学习排名”领域中的研究激增,其目的在于使用标记的训练数据和机器学习算法来构建可靠的排名功能。诸如神经网络,支持向量机和最小二乘之类的机器学习方法已成功应用于排名问题,并且其中一些已经在商业搜索引擎中部署。尽管取得了这些成功,但迄今为止,大多数算法都在有监督的学习环境中构造了排名功能,这些算法假定相关标签是在训练排名功能之前由人工注释者提供的。当人类相关性判断无法用于广泛的查询时,此类方法的效果可能会很差。在本文中,我们研究了是否可以使用其他易于获取的未标记数据来改进监督算法。特别是,我们研究了转导设置,其中未标记的数据等同于测试数据。我们提出了一个简单而又灵活的转导元算法:关键思想是在观察需要排序的文档后使训练程序适应每个测试列表。我们研究了此通用框架的两个实例:特征生成方法基于从未标记的测试数据中发现更多显着特征,并在此依赖于测试的特征集上训练了排名工具。重要性加权方法基于领域适应文献中的思想,并且通过对训练数据进行重新加权以匹配每个测试列表的统计信息来工作。我们证明了这两种方法都比LETOR数据集中的TREC和OHSUMED任务的监督算法有所改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号