首页> 外文学位 >Crawling the Web: Discovery and maintenance of large-scale Web data.
【24h】

Crawling the Web: Discovery and maintenance of large-scale Web data.

机译:爬行Web:发现和维护大规模Web数据。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation studies the challenges and issues faced in implementing an effective Web crawler. A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler's local network, because they are intrinsically shared resources.; This dissertation studies how we can build an effective Web crawler that can retrieve “high quality” pages quickly, while maintaining the retrieved pages “fresh.” Towards that goal, we first identify popular definitions for the “importance” of pages and propose simple algorithms that can identify important pages at the early stage of a crawl. We then explore how we can parallelize a crawling process to maximize the download rate while minimizing the overhead from parallelization. Finally, we experimentally study how Web pages change over time and propose an optimal page refresh policy that maximizes the “freshness” of the retrieved pages.; This work has been a part of the WebBase project at Stanford University. The WebBase system currently maintains 130 million pages downloaded from the Web and these pages are being used actively by many researchers within and outside of Stanford. The crawler for the WebBase project is a direct result of this dissertation research.
机译:本文研究了实现有效的Web爬网程序所面临的挑战和问题。搜寻器是从Web检索和存储页面的程序,通常用于Web搜索引擎。搜寻器通常必须在短时间内下载数亿个页面,并且必须不断监视和刷新下载的页面。另外,搜寻器应避免对所访问的网站和搜寻器的本地网络施加太大的压力,因为它们本质上是共享资源。本文研究了如何构建有效的Web搜寻器,该搜寻器可以快速检索“高质量”页面,同时又保持检索到的页面“新鲜”。为了实现该目标,我们首先确定页面“重要性”的流行定义,并提出简单的算法,以在抓取的早期阶段识别重要页面。然后,我们探索如何并行化爬网过程以最大化下载速率,同时最小化并行化的开销。最后,我们通过实验研究网页如何随时间变化,并提出了一种最佳的页面刷新策略,该策略可以最大化检索到的页面的“新鲜度”。这项工作已成为斯坦福大学WebBase项目的一部分。 WebBase系统当前维护着从Web下载的1.3亿个页面,这些页面正被斯坦福大学内部和外部的许多研究人员所积极使用。 WebBase项目的搜寻器是本论文研究的直接结果。

著录项

  • 作者

    Cho, Junghoo.;

  • 作者单位

    Stanford University.;

  • 授予单位 Stanford University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2002
  • 页码 188 p.
  • 总页数 188
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号