首页> 外国专利> Dynamic-content web crawling through traffic monitoring

Dynamic-content web crawling through traffic monitoring

机译:通过流量监控进行动态内容Web爬网

摘要

A dynamic-content web crawler is disclosed. These New Crawlers (NCs) are located at points between the server and user, and monitor content from said points, for example by proxying the web traffic or sniffing the traffic as it goes by. Web page content is recursively parsed into subcomponents. Sub-components are fingerpinted with a cyclic redundancy check code or other loss-full compression in order to be able to detect recurrence of the sub-component in subsequent pages. Those sub-components which persist in the web traffic, as measured by the frequency NCs (6) are defined as having substantive content of interest to data-mining applications. Where a substantive content sub-component is added to or removed from a web page, then this change is significant and is sent to a duplication filter (11) so that if multiple NCs (6) detect a change in a web page only one announcement of the changed URL will be broadcast to data-mining applications (8). The NC (6) identifies substantive content sub-components which repeatably are part of a page pointed to by a URL. Provision is also made for limiting monitoring to pages having a flag authorizing discovery of the page by a monitor.
机译:公开了一种动态内容网络搜寻器。这些新爬网程序(NC)位于服务器和用户之间的位置,并监视这些位置的内容,例如,通过代理Web流量或在流量经过时对其进行监听。网页内容被递归解析为子组件。子组件使用循环冗余校验码或其他完全丢失压缩进行指纹识别,以便能够检测到后续页面中子组件的重复出现。通过频率NC( 6 )度量的那些持久存在于Web流量中的子组件被定义为具有数据挖掘应用程序感兴趣的实质内容。如果将实质内容子组件添加到网页或从网页中删除,则此更改意义重大,并将其发送到复制过滤器( 11 ),以便在有多个NC( 6 < / B>)检测到网页中的更改,只有更改的URL的一个公告将广播到数据挖掘应用程序( 8 )。 NC( 6 )标识实质内容子组件,这些子组件可重复地是URL指向的页面的一部分。还规定将监视限制在具有标志的页面上,该标志授权监视器发现该页面。

著录项

  • 公开/公告号US2004128285A1

    专利类型

  • 公开/公告日2004-07-01

    原文格式PDF

  • 申请/专利权人 GREEN JACOB;SCHULTZ JOHN;

    申请/专利号US20040433605

  • 发明设计人 JOHN SCHULTZ;JACOB GREEN;

    申请日2004-02-20

  • 分类号G06F17/30;

  • 国家 US

  • 入库时间 2022-08-21 23:19:37

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号