
A dynamic URL assignment method for parallel web crawler




A web crawler is a relatively simple automated program or script that methodically scans or “crawls” through Internet pages to retrieval information from data. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. There are many different uses for a web crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. In this work we propose the model of a low cost web crawler for distributed environments based on an efficient URL assignment algorithm. The function of every module of the crawler is analyzed and main rules that crawlers must follow to maintain load balancing and robustness of system when they are searching on the web simultaneously, are discussed. The proposed a dynamic URL assignment method, based on grid computing technology and dynamic clustering, results efficient increasing web crawler performance.
机译:Web搜寻器是一种相对简单的自动化程序或脚本,可以有条不紊地扫描或“爬网”整个Internet页面以从数据中检索信息。网络搜寻器的替代名称包括网络蜘蛛,网络机器人,漫游器,搜寻器和自动索引器。 Web搜寻器有许多不同的用途。他们的主要目的是收集数据,以便当互联网冲浪者在其站点上输入搜索词时,他们可以快速为冲浪者提供相关的网站。在这项工作中,我们提出了一种基于高效URL分配算法的低成本Web爬虫模型,用于分布式环境。分析了爬虫的每个模块的功能,并讨论了当它们同时在Web上搜索时,爬虫必须遵循的主要规则以维持系统的负载平衡和鲁棒性。提出了一种基于网格计算技术和动态聚类的动态URL分配方法,可有效提高Web搜寻器的性能。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号