首页> 外文会议>ACM conference on information and knowledge management >CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance
【24h】

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance

机译:Citedata:用于评估个性化搜索性能的新多方面数据集

获取原文

摘要

Personalized search systems have evolved to utilize heterogeneous features including document hyperlinks, category labels in various taxonomies and social tags in addition to free-text of the documents. Consequently, classifiers, PageR-ank algorithms and Collaborative Filtering methods are often used as intermediate steps in such personalized retrieval systems. Thorough comparative evaluation of such complex systems has been difficult due to the lack of appropriate publicly available datasets that provide such diverse feature sets. To remedy the situation, we have created CiteData, a new dataset for benchmark evaluations of personalized search performance, that will be made publicly accessible. CiteData is a collection of academic articles extracted from CiteULike and CiteSeer repositories, with rich feature sets such as authors, author-affiliations, topic labels, social tags and citation information. We further supplement it with personalized queries and relevance judgments which were obtained from volunteer users. This paper starts with a discussion of the design criteria and characteristics of the CiteData dataset in comparison with current benchmark datasets, followed by a set of task-oriented empirical evaluations of popular algorithms in statistical classification, collaborative filtering and link analysis as intermediate steps for personalized search. Our results show significant performance improvement of personalized approaches, over that of unpersonalized approaches. We also observe that a meta personalized search engine that leverages information from multiple sources of features performs better than algorithms that use only one of the constituent source of features.
机译:个性化的搜索系统已经进化以利用异构特征,包括文档超链接,除了文件的自由文本之外,各种分类和社交标签中的类别标签还有类别标签。因此,分类器,寻呼机-ANK算法和协同滤波方法通常用作这种个性化检索系统中的中间步骤。由于缺乏提供此类不同特征集的适当公开可用的数据集,对这种复杂系统的彻底比较评估一直很困难。为了解决这种情况,我们创建了一个新的数据集,用于个性化搜索性能的基准评估,将可公开访问。 Citedata是从Citeulik和CiteEser存储库中提取的学术文章的集合,具有丰富的功能集,如作者,作者 - 隶属关系,主题标签,社交标签和引用信息。我们进一步补充了与志愿用户获得的个性化查询和相关性判决。本文开始讨论Citedata DataSet的设计标准和特征,与当前的基准数据集相比,其次是一组针对统计分类,协作滤波和链接分析的流行算法的一组任务导向的实证评估,作为个性化的中间步骤搜索。我们的结果表明,个性化方法的显着性能提高,超出了个性化方法。我们还观察到,从多个特征源利用信息的元个性化搜索引擎比仅使用一个组成特征来源的算法更好地执行更好的算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号