
A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases




A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs' entry points, i.e., searchable forms, in the Web. It has been a challenging task because domain-specific WDBs' forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more intelligent and effective solutions remain to be further explored. In this paper, a new architecture of an intelligent agent-based crawler (iCrawler) for domain-specific Deep Web databases has been proposed to address the limitations of the existing methods. The iCrawler, based on intelligent learning agents and domain ontology, and a series of novel and effective strategies, including a two-step page classifier, a link scoring strategy, etc, can improve the performance of the existing methods. Experiments of the iCrawler over a number of real Web pages in a set of representative domains have been conducted and the results show that the iCrawler outperforms the existing domain-specific Deep Web Form-Focused Crawlers (FFCs) in terms of the harvest rate, coverage rate and time performance.
机译:在线从大规模的深层Web数据库(WDB)检索,集成和挖掘丰富和高质量的信息的关键问题是如何在网络中自动有效地发现和识别特定于域的WDB的入口点(即可搜索形式)。这是一项具有挑战性的任务,因为具有动态和异构属性的特定于域的WDB表单非常稀疏地分布在几万亿个Web页面上。尽管已为解决该问题及其特殊情况做出了巨大的努力,但仍需要进一步探索更加智能和有效的解决方案。在本文中,针对域特定的Deep Web数据库,提出了一种基于智能代理的智能爬网程序(iCrawler)的新体系结构,以解决现有方法的局限性。基于智能学习代理和领域本体的iCrawler以及一系列新颖有效的策略(包括两步页面分类器,链接评分策略等)可以改善现有方法的性能。在一组代表域中的多个真实网页上进行了iCrawler的实验,结果表明,iCrawler在收获率,覆盖率方面优于现有的特定于域的Deep Web Form-focused爬网程序(FFC)。率和时间表现。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号