首页> 外文会议>International Symposium on String Processing and Information Retrieval >What's Changed? Measuring Document Change in Web Crawling for Search Engines
【24h】

What's Changed? Measuring Document Change in Web Crawling for Search Engines

机译:什么改变了?测量搜索引擎Web爬网的文档变化

获取原文

摘要

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes-such as in images, advertisements, and headers-are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.
机译:提供快速,可扩展的搜索设施,网络搜索引擎在本地存储集合。通过爬网收集收集。爬行的问题正在确定何时重新审视资源,因为它们已更改:陈旧文档对搜索结果的贡献贡献,而不必要的刷新是昂贵的。但是,一些更改 - 例如在图像,广告和标题中 - 不太可能影响查询结果。在本文中,我们调查了确定文件是否已更改的措施,并应将其报告。我们表明基于内容的措施比使用HTTP标头的传统方法更有效。基于HTTP标题的刷新通常会覆盖每天的16%的集合,但用户不会检索大多数刷新文档。相比之下,当超过二十个单词更改时,刷新文档突出了22%的收集,但更有效地更新了文档。我们得出结论,我们的简单措施是网络爬行策略的有效组成部分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号