Crawling the Web: Discovery and maintenance of large-scale Web data.

机译：爬行Web：发现和维护大规模Web数据。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This dissertation studies the challenges and issues faced in implementing an effective Web crawler. A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler's local network, because they are intrinsically shared resources.; This dissertation studies how we can build an effective Web crawler that can retrieve “high quality” pages quickly, while maintaining the retrieved pages “fresh.” Towards that goal, we first identify popular definitions for the “importance” of pages and propose simple algorithms that can identify important pages at the early stage of a crawl. We then explore how we can parallelize a crawling process to maximize the download rate while minimizing the overhead from parallelization. Finally, we experimentally study how Web pages change over time and propose an optimal page refresh policy that maximizes the “freshness” of the retrieved pages.; This work has been a part of the WebBase project at Stanford University. The WebBase system currently maintains 130 million pages downloaded from the Web and these pages are being used actively by many researchers within and outside of Stanford. The crawler for the WebBase project is a direct result of this dissertation research.

机译：本文研究了实现有效的Web爬网程序所面临的挑战和问题。搜寻器是从Web检索和存储页面的程序，通常用于Web搜索引擎。搜寻器通常必须在短时间内下载数亿个页面，并且必须不断监视和刷新下载的页面。另外，搜寻器应避免对所访问的网站和搜寻器的本地网络施加太大的压力，因为它们本质上是共享资源。本文研究了如何构建有效的Web搜寻器，该搜寻器可以快速检索“高质量”页面，同时又保持检索到的页面“新鲜”。为了实现该目标，我们首先确定页面“重要性”的流行定义，并提出简单的算法，以在抓取的早期阶段识别重要页面。然后，我们探索如何并行化爬网过程以最大化下载速率，同时最小化并行化的开销。最后，我们通过实验研究网页如何随时间变化，并提出了一种最佳的页面刷新策略，该策略可以最大化检索到的页面的“新鲜度”。这项工作已成为斯坦福大学WebBase项目的一部分。 WebBase系统当前维护着从Web下载的1.3亿个页面，这些页面正被斯坦福大学内部和外部的许多研究人员所积极使用。 WebBase项目的搜寻器是本论文研究的直接结果。

著录项

作者
Cho, Junghoo.;
展开▼
作者单位

Stanford University.;

展开▼
授予单位 Stanford University.;
学科 Computer Science.
学位 Ph.D.
年度 2002
页码 188 p.
总页数 188
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. PolarHub: A large-scale web crawling engine for OGC service discovery in cyberinfrastructure [J] . Li Wenwen, Wang Sizhe, Bhatia Vidit Computers，environment and urban systems . 2016,第sepa期

机译：PolarHub：用于网络基础设施中OGC服务发现的大型Web爬行引擎
2. Unsupervised domain ranking in large-scale web crawls [J] . Mercedes Martinez Gonzalez Computing reviews . 2019,第7期

机译：大型Web爬网中的无监督域排名
3. Unsupervised domain ranking in large-scale web crawls [J] . Mercedes Martinez Gonzalez Computing reviews . 2019,第7期

机译：无监督的域名在大型Web爬网中排名
4. Board Forum Crawling: A Web Crawling Method for Web Forum [C] . Yan Guo, Kui Li, Kai Zhang, IEEE/WIC/ACM International Conference on Intelligent Agent Technology . 2006

机译：董事会论坛爬行：网络论坛的Web爬网方法
5. Automatic discovery and selection of text resources on the Web, towards building a very large-scale and effective metasearch engine, Webscales. [D] . Wu, Zonghuan. 2002

机译：自动发现和选择Web上的文本资源，以构建非常大规模和有效的元搜索引擎Webscales。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Board Forum Crawling: A Web Crawling Method for Web Forum [O] . Yan Guo, Kui Li, Kai Zhang, 2006

机译：Board Forum Crawling：Web论坛的Web爬行方法

Crawling the Web: Discovery and maintenance of large-scale Web data.

摘要

著录项

相似文献

相关主题

期刊订阅