...
首页> 外文期刊>ACM transactions on database systems >Effective Page Refresh Policies for Web Crawlers
【24h】

Effective Page Refresh Policies for Web Crawlers

机译:Web爬网程序的有效页面刷新策略

获取原文
获取原文并翻译 | 示例

摘要

In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date. This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.
机译:在本文中,我们研究了当自主且独立地更新源数据时,如何维护远程数据源的本地副本。特别是,我们研究了Web爬网程序的问题,该爬网程序为Web搜索引擎维护远程Web页面的本地副本。在这种情况下,远程数据源(网站)不会将新更改通知副本(Web爬网程序),因此我们需要定期轮询源以保持副本为最新。由于轮询源会占用大量时间和资源,因此很难完全保持副本的最新状态。本文提出了各种刷新策略并研究了其有效性。我们首先通过定义两个新鲜度指标来形式化复制数据的“新鲜度”概念,然后提出一种Poisson流程作为数据源的更改模型。在此框架的基础上,我们通过分析和实验的方式检查了所提出的刷新策略的有效性。我们证明了泊松过程是描述网页变化的一个很好的模型,并且也表明我们提出的刷新策略可以极大地改善数据的“新鲜度”。在某些情况下,我们从现有政策中获得了数量级的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号