首页> 外文会议>International conference on web engineering >Indexing Rich Internet Applications Using Components-Based Crawling
【24h】

Indexing Rich Internet Applications Using Components-Based Crawling

机译:使用基于组件的爬网索引富Internet应用程序

获取原文

摘要

Automatic crawling of Rich Internet Applications (RIAs) is a challenge because client-side code modifies the client dynamically, fetching server-side data asynchronously. Most existing solutions model RIAs as state machines with DOMs as states and JavaScript events execution as transitions. This approach fails when used with "real-life", complex RIAs, because the size of the produced model is much too large to be practical. In this paper, we propose a new method to crawl AJAX-based RIAs in an efficient manner by detecting "components", which are areas of the DOM that are independent from each other, and by crawling each component separately. This leads to a dramatic reduction of the required state space for the model, without loss of content coverage. Our method does not require prior knowledge of the RIA nor predefined definition of components. Instead, we infer the components by observing the behavior of the RIA during crawling. Our experimental results show that our method can index quickly and completely industrial RIAs that are simply out of reach for traditional methods.
机译:富Internet应用程序(RIA)的自动爬网是一个挑战,因为客户端代码会动态修改客户端,从而异步获取服务器端数据。大多数现有解决方案将RIA建模为状态机,以DOM作为状态,将JavaScript事件执行作为过渡。当与“现实生活”的复杂RIA一起使用时,此方法会失败,因为生成的模型的大小太大而无法实用。在本文中,我们提出了一种新方法,即通过检测“组件”(它们是相互独立的DOM区域),并分别对每个组件进行爬网,以有效的方式对基于AJAX的RIA进行爬网。这导致模型所需的状态空间大大减少,而不会损失内容覆盖率。我们的方法不需要RIA的先验知识,也不需要组件的预定义。相反,我们通过在爬网期间观察RIA的行为来推断组件。我们的实验结果表明,我们的方法可以快速,完整地索引工业RIA,而传统RIA根本无法做到这一点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号