首页> 外文期刊>International Journal of Engineering & Technology >An XML based Web Crawler with Page Revisit Policy and Updation in Local Repository of Search Engine
【24h】

An XML based Web Crawler with Page Revisit Policy and Updation in Local Repository of Search Engine

机译:基于XML的Web爬网程序,在搜索引擎的本地存储库中具有页面重新访问策略和更新

获取原文
           

摘要

In a large collection of web pages, it is difficult for search engines to keep their online repository updated. Major search engines have hundreds of web crawlers that crawl the WWW day and night and send the downloaded web pages via a network to be stored in the search engine’s database. These results in over utilization of network resources like bandwidth, CPU cycles and so on. This paper proposes an architecture that tries to reduce the utilization of shared network resources with the help of an advanced XML based approach. This focused crawling based architecture is trained to download only the high quality data from the internet leaving behind the web pages which are not relevant to the desired domain. Here, a detailed layout of the proposed system is described which is capable of reducing the load on network and reducing the problem arise in residency of mobile agent at the remote server.
机译:在大量的网页中,搜索引擎很难保持其在线存储库的更新。大型搜索引擎有数百个Web爬虫,它们会日夜爬行WWW并通过网络发送下载的网页,并将其存储在搜索引擎的数据库中。这些导致过度利用网络资源,例如带宽,CPU周期等。本文提出了一种架构,该架构试图借助基于XML的高级方法来减少共享网络资源的利用率。这种基于爬网的集中式体系结构经过培训,可以从互联网上仅下载高质量数据,而留下与所需域无关的网页。这里,描述了所提出的系统的详细布局,其能够减少网络上的负载并减少移动代理在远程服务器上的驻留问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号