首页> 外文期刊>IEEE Journal on Selected Areas in Communications >Optimal Web Page Download Scheduling Policies for Green Web Crawling
【24h】

Optimal Web Page Download Scheduling Policies for Green Web Crawling

机译:绿色网页爬网的最佳网页下载调度策略

获取原文
获取原文并翻译 | 示例
       

摘要

A web crawler is responsible for discovering and downloading new pages on the Web as well as refreshing previously downloaded pages. During these operations, the crawler issues a large number of HTTP requests to web servers. These requests increase the energy consumption and carbon footprint of the web servers since computational resources are used while serving the requests. In this work, we introduce the problem of green web crawling, where the objective is to devise a page refresh policy that minimizes the total staleness of pages in the repository of a web crawler, subject to a constraint on the amount of carbon emissions due to the processing on web servers. For the case of one web server and one crawling thread, the optimal policy turns out to be a greedy one. At each iteration, the page to be refreshed is selected based on a metric that considers the page’s staleness, its size, and the greenness of the energy consumed at the web server premises. We then extend the optimal policy to the cases of 1)  many servers; 2)  multiple threads; and 3)  pages with variable freshness requirements. We conduct simulations on a real data set that involves a large web server collection hosting around two billion pages. We present experimental results for the optimal page refresh policy as well as for various heuristics, in an effort to study the effect of different factors on performance.
机译:Web搜寻器负责在Web上发现和下载新页面以及刷新以前下载的页面。在这些操作期间,搜寻器向Web服务器发出大量HTTP请求。这些请求增加了Web服务器的能耗和碳足迹,因为在处理请求时会使用计算资源。在这项工作中,我们介绍了绿色网络爬网的问题,其目的是设计一种页面刷新策略,以最大程度地减少网络爬网程序存储库中页面的总陈旧度,并限制由于以下原因导致的碳排放量: Web服务器上的处理。对于一台Web服务器和一个爬网线程,最佳策略被证明是一种贪婪的策略。每次迭代时,都会根据衡量指标来选择要刷新的页面,该指标考虑页面的陈旧性,大小以及Web服务器场所消耗的能源的绿色程度。然后,我们将最佳策略扩展到以下情况:1)许多服务器; 2)多线程;和3)具有不同新鲜度要求的页面。我们对真实数据集进行模拟,该数据集包含托管约20亿页的大型Web服务器集合。为了研究不同因素对性能的影响,我们提出了最佳页面刷新策略以及各种启发式方法的实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号