首页> 外国专利> Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue

Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue

机译:Web爬虫系统,它使用并行队列来查询具有公共地址的数据集并同时下载与每个队列中的数据集相关的数据

摘要

A method and system for scheduling downloads in a web crawler. A web crawler may use multiple threads to download documents from the world wide web. Both threads and queues are identified by numerical ID's. Each thread in the web crawler is assigned to dequeue from a queue until the assigned queue is empty. Each thread enqueues URL's as new URL's are discovered in the course of downloading web pages. In one embodiment, when a thread discovers a new URL, a numerical function is performed on the URL's host component to determine the queue in which to enqueue the new URL. In another embodiment, each queue in a web crawler may be dynamically assigned to a host computer so that URL's enqueued into the same queue all have the same host component. When a queue becomes empty, a new host may be dynamically assigned to it. In both embodiments, when all the threads are dequeuing in parallel from each of the respectively assigned queues, no more than one request to one host computer is made at the same time.
机译:一种用于在网络爬虫中调度下载的方法和系统。 Web爬网程序可能使用多个线程从万维网下载文档。线程和队列均由数字ID标识。 Web搜寻器中的每个线程都被分配为从队列中出队,直到分配的队列为空。每个线程都会排队URL,因为在下载网页的过程中会发现新的URL。在一个实施例中,当线程发现新的URL时,在URL的主机组件上执行数值函数以确定将新的URL加入队列的队列。在另一个实施例中,可以将网络搜寻器中的每个队列动态地分配给主机,以便排队到相同队列中的URL都具有相同的主机组件。当队列为空时,可以为它动态分配一个新主机。在两个实施例中,当所有线程都从每个分别分配的队列中并行出队时,同时向一个主机发出的请求不超过一个。

著录项

  • 公开/公告号US6377984B1

    专利类型

  • 公开/公告日2002-04-23

    原文格式PDF

  • 申请/专利权人 ALTA VISTA COMPANY;

    申请/专利号US19990433004

  • 发明设计人 CLARK ALLAN HEYDON;MARC ALEXANDER NAJORK;

    申请日1999-11-02

  • 分类号G06F151/60;G06F151/73;

  • 国家 US

  • 入库时间 2022-08-22 00:48:08

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号