首页> 外文期刊>Information retrieval >Server selection methods in personal metasearch: a comparative empirical study

Server selection methods in personal metasearch: a comparative empirical study


获取原文并翻译 | 示例


Server selection is an important subproblem in distributed information retrieval (DIR) but has commonly been studied with collections of more or less uniform size and with more or less homogeneous content. In contrast, realistic DIR applications may feature much more varied collections. In particular, personal metasearch-a novel application of DIR which includes all of a user's online resources-may involve collections which vary in size by several orders of magnitude, and which have highly varied data. We describe a number of algorithms for server selection, and consider their effectiveness when collections vary widely in size and are represented by imperfect samples. We compare the algorithms on a personal metasearch testbed comprising calendar, email, mailing list and web collections, where collection sizes differ by three orders of magnitude. We then explore the effect of collection size variations using four partitionings of the TREC ad hoc data used in many other DIR experiments. Kullback-Leibler divergence, previously considered poorly effective, performs better than expected in this application; other techniques thought to be effective perform poorly and are not appropriate for this problem. A strong correlation with size-based rankings for many techniques may be responsible.
机译:服务器选择是分布式信息检索(DIR)中的一个重要子问题,但是通常已经研究了具有大致相同大小和大致相同内容的集合。相反,现实的DIR应用程序可能具有更多不同的集合。尤其是,个人元搜索-一种DIR的新颖应用程序,其中包括用户的所有在线资源-可能涉及大小相差几个数量级且具有高度变化数据的集合。我们描述了许多用于服务器选择的算法,并考虑了当集合的大小差异很大且以不完善的样本表示时它们的有效性。我们在包含日历,电子邮件,邮件列表和Web集合的个人元搜索测试平台上比较了算法,其中集合大小相差三个数量级。然后,我们使用在许多其他DIR实验中使用的TREC ad hoc数据的四个分区来探索集合大小变化的影响。先前认为效果不佳的Kullback-Leibler散度在此应用程序中的表现要好于预期。其他被认为有效的技术效果很差,不适用于此问题。与许多技术的基于大小的排名密切相关可能是造成这种情况的原因。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号