首页> 外国专利> System and a method for focused re-crawling of Web sites

System and a method for focused re-crawling of Web sites

机译:网站集中重新爬网的系统和方法

摘要

A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
机译:公开了一种对Web( 620 )进行爬网的方法( 100 )。方法( 100 )从给定( 110 )组种子通用资源定位符(URL)开始,在Web上爬网( 120 )网页。 )。爬网的网页被分为(B> 140 )一组相关和不相关的页面。从一组相关页面和不相关页面中发现了一组排除和/或包含模式( 150 ),并且通过该组排除和/或包含模式限制了Web的后续爬网。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号