首页> 外国专利> System and method for locating pages on the world wide web and for locating documents from a network of computers

System and method for locating pages on the world wide web and for locating documents from a network of computers

机译:用于在万维网上定位页面和用于定位来自计算机网络的文档的系统和方法

摘要

A Web crawler system and method for quickly fetching and analyzing Web pages on the World Wide Web includes a hash table stored in random access memory (RAM) and a sequential Web information disk file. For every Web page known to the system, the Web crawler system stores an entry in the sequential disk file as well as a smaller entry in the hash table. The hash table entry includes a fingerprint value, a fetched flag that is set true only if the corresponding Web page has been successfully fetched, and a file location indicator that indicates where the corresponding entry is stored in the sequential disk file. Each sequential disk file entry includes the URL of a corresponding Web page, plus fetch status information concerning that Web page. All accesses to the Web information disk file are made sequentially via an input buffer such that a large number of entries from the sequential disk file are moved into the input buffer as single I/O operation. The sequential disk file is then accessed from the input buffer. Similarly, all new entries to be added to the sequential file are stored in an append buffer, and the contents of the append buffer are added to the end of the sequential whenever the append buffer is filled. In this way random access to the Web information disk file is eliminated, and latency caused by disk access limitations is minimized.
机译:一种用于在万维网上快速获取和分析网页的Web搜寻器系统和方法,包括存储在随机存取存储器(RAM)和顺序Web信息磁盘文件中的哈希表。对于系统已知的每个网页,Web搜寻器系统都会在顺序磁盘文件中存储一个条目,并在哈希表中存储一个较小的条目。哈希表条目包括指纹值,仅当成功获取相应网页后才设置为true的获取标志,以及指示相应条目在顺序磁盘文件中存储位置的文件位置指示符。每个顺序的磁盘文件条目都包括相应网页的URL,以及与该网页有关的获取状态信息。 Web信息磁盘文件的所有访问都是通过输入缓冲区顺序进行的,因此,作为单个I / O操作,来自顺序磁盘文件的大量条目将移入输入缓冲区。然后从输入缓冲区访问顺序磁盘文件。同样,所有要添加到顺序文件的新条目都存储在附加缓冲区中,并且每当附加缓冲区被填充时,附加缓冲区的内容就会添加到顺序文件的末尾。这样,就消除了对Web信息磁盘文件的随机访问,并使磁盘访问限制所导致的延迟最小化。

著录项

  • 公开/公告号EP1241594A3

    专利类型

  • 公开/公告日2005-03-09

    原文格式PDF

  • 申请/专利权人 ALTA VISTA COMPANY;

    申请/专利号EP20020012929

  • 发明设计人 MONIER LOUIS M.;

    申请日1996-12-10

  • 分类号G06F17/30;

  • 国家 EP

  • 入库时间 2022-08-21 22:09:53

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号