首页> 外文会议>String Processing and Information Retrieval >What's Changed? Measuring Document Change in Web Crawling for Search Engines
【24h】

What's Changed? Measuring Document Change in Web Crawling for Search Engines

机译:有什么变化?在搜索引擎的Web爬网中测量文档更改

获取原文

摘要

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes ― such as in images, advertisements, and headers ― axe unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.
机译:为了提供快速,可扩展的搜索功能,Web搜索引擎将集合存储在本地。通过爬网收集这些集合。爬网的问题在于确定何时重新访问资源,因为它们已更改:过时的文档会导致较差的搜索结果,而不必要的刷新则很昂贵。但是,某些更改(例如图像,广告和标题中的更改)不太可能影响查询结果。在本文中,我们研究了确定文档是否已更改和应重新检索的措施。我们展示了基于内容的度量比使用HTTP标头的传统方法更有效。基于HTTP标头的刷新通常每天会刷新集合的16%,但是用户不会检索到大多数刷新的文档。相反,当更改了二十多个单词时,刷新文档将占集合的22%,但更有效地更新了文档。我们得出的结论是,我们的简单措施是网络爬网策略的有效组成部分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号