首页> 外文会议>European conference on information retrieval research >A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09
【24h】

A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09

机译:ClueWeb09中带有相关网页的现实已知项目主题集

获取原文

摘要

Known-item finding is the task of finding a previously seen item. Such items may range from visited websites to received emails but also read books or seen movies. Most of the research done on known-item finding focuses on web or email retrieval and is done on proprietary corpora not publically available. Public corpora usually are rather artificial as they contain automatically generated known-item queries or queries formulated by humans actually seeing the known-item. In this paper, we study original known-item information needs mined from questions at the popular Yahoo! Answers Q&A service. By carefully sampling only questions with a related known-item web page in the ClueWeb09 corpus, we provide an environment for repeatable realistic studies of known-item information needs and how a retrieval system could react. In particular, our own study sheds some first light on false memories within the known-item questions articulated by the users. Our main finding shows that false memories often relate to mixed up names. This indicates that search engines not retrieving any result on a known-item query could try to avoid returning a zero-result list by ignoring or replacing names in respective query situations. Our publically available corpus of 2,755 known-item questions mapped to web pages in the ClueWeb09 includes 240 questions with annotated and corrected false memories.
机译:已知项目查找是查找以前查看过的项目的任务。此类项目的范围可能从访问过的网站到收到的电子邮件,还可以阅读书籍或看过的电影。对已知项目发现所做的大多数研究都集中在Web或电子邮件检索上,并且是针对未公开提供的专有语料库进行的。公共语料库通常是人为的,因为它们包含自动生成的已知项查询或由实际看到已知项的人提出的查询。在本文中,我们研究了从流行的Yahoo!网站上的问题中挖掘出的原始已知项目信息需求。回答问答服务。通过在ClueWeb09语料库中仅对相关已知项目网页中的问题进行仔细采样​​,我们为已知项目信息需求以及检索系统如何反应提供了可重复的现实研究环境。特别是,我们自己的研究首次揭示了用户提出的已知项目问题中的错误记忆。我们的主要发现表明,错误的记忆通常与名字混淆有关。这表明未在已知项目查询中检索任何结果的搜索引擎可以通过忽略或替换各个查询情况下的名称来尝试避免返回零结果列表。我们在ClueWeb09中映射到Web页面的2755个已知项问题的公开语料库包括240个带注释和更正错误记忆的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号