首页> 外文期刊>The VLDB Journal >OXPath: A language for scalable data extraction, automation, and crawling on the deep web
【24h】

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

机译:OXPath:一种用于可扩展的数据提取,自动化和在深度网络上进行爬网的语言

获取原文
获取原文并翻译 | 示例
           

摘要

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.
机译:Web的发展已经超越了自身:越来越多的信息和日益复杂的界面要求进行自动化处理,然而,这种增长使现有的自动化和数据提取技术不堪重负。为了应对这一趋势,我们确定了Web数据提取,自动化和(重点)Web爬网的四个关键要求:(1)与复杂的Web应用程序界面进行交互;(2)精确捕获要提取的相关数据;(3)规模化(4)轻松嵌入到现有的网络技术中。我们引入OXPath作为XPath的扩展,用于与Web应用程序进行交互并提取由此揭示的数据-满足以上所有要求。 OXPath的一次页面评估可确保内存使用不受访问页面数的影响,但仍保持多项式时间。我们通过实验验证了理论上的复杂性,并证明了OXPath的资源消耗主要由底层浏览器中的页面渲染控制。通过对OXPath的子语言和属性进行深入研究,我们确定了特定功能对评估性能的影响。我们的实验表明,OXPath大大优于现有的商业和学术数据提取工具。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号