
Discovering URLs through User Feedback




Search engines reiy upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.
机译:在爬网中搜索引擎Reiy建立他们的网页集合。 Web爬网程序通常通过遵循Web页面上的链接引起的链路结构来发现新的URL。随着Web上的文档的数量很大,发现新创建的URL可能是任意长的,并且根据给定页面的连接方式,这样的爬网程序可能会非常想念页面。在本文中,我们评估将被动URL发现机制集成到Web爬网履带中的好处。这种机制是被动的,因为它不需要爬虫来主动从网站上获取文档以发现URL。我们在这里专注于使用工具栏数据作为新URL发现的代表源的机制。我们使用雅虎的工具栏日志!要通过浏览器来表征用户访问的URL,但是雅虎未发现网履带。我们表明,爬虫未发现工具栏日志中出现的高分URL。我们还揭示了一小部分URL在履行者之后发现了他们首次被用户访问的时间。我们的工作的一个重要结论是,Web搜索引擎可以从用户的反馈中高效,以工具栏日志为被动URL发现。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号