首页> 外文会议>International conference on networked systems >GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications
【24h】

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

机译:GDist-RIA爬网程序:针对富Internet应用程序的贪婪分布式爬网程序

获取原文

摘要

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.
机译:爬网应用程序对于索引编制,可访问性和安全性评估很重要。爬行传统的Web应用程序是一个老问题,众所周知,好的和有效的解决方案。但是,快速有效地爬网富Internet应用程序(RIA)是一个开放的问题。诸如AJAX和部分文档对象模型(DOM)更新之类的技术只会使对RIA进行爬网的问题更加耗费Web爬网程序。减少对RIA进行爬网的时间的一种方法是对多台计算机并行地对RIA进行爬网。先前发布的Dist-RIA搜寻器提出了一种分布式呼吸优先搜索算法来搜寻RIA。本文通过两种方式扩展Dist-RIA爬虫。首先,它引入了一种自适应负载平衡算法,该算法可使搜寻器了解节点的速度并适应变化,从而更好地利用资源。其次,它提出了一种分布式贪婪算法,以并行方式爬行RIA,称为GDist-RIA爬行器。 GDist-RIA爬网程序使用服务器-客户端体系结构,其中服务器将爬网作业分派给爬网客户端。本文说明了GDist-RIA搜寻器的原型实现,解释了用于实现原型并检查经验性能度量的一些技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号