首页> 外国专利> Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages

Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages

机译:无监督的自动Web主机动态检测,死链接检测和搜索索引网页的必备页面发现

摘要

Automated crawling of page links associated with a site domain that was previously crawled involves computing the dynamicity of a site based on totals of continuous dead links, live links and/or prerequisite pages encountered while crawling page links corresponding to the site. The degree to which links are crawled is optimized based on the dynamicity of the site. Some pages require that another particular page (i.e., a prerequisite page) is retrieved from the host prior to retrieving a given page, e.g., so that the prerequisite page can set a cookie. Prerequisite pages are determined based on stored information about pages that were retrieved, during a previous crawl, prior to retrieving a page. Prerequisite pages are identified to a search system so that when a user clicks on the URL for the page, the request is redirected to the prerequisite page to set the cookie appropriately.
机译:自动爬网与先前爬网的站点域相关联的页面链接涉及基于在爬网对应于该站点的页面链接时所遇到的连续无效链接,活动链接和/或必备页面的总数来计算站点的动态。根据站点的动态性,对爬网的程度进行了优化。某些页面要求在检索给定页面之前从主机检索另一个特定页面(即,先决条件页面),以便先决条件页面可以设置cookie。前提条件页面是根据存储的有关页面的信息确定的,这些信息是在上一次爬网期间检索页面之前检索到的页面的。先决条件页面被标识到搜索系统,以便当用户单击页面的URL时,请求将重定向到先决条件页面以适当地设置cookie。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号