...
首页> 外文期刊>Journal of Intelligent Information Systems >E-FFC: an enhanced form-focused crawler for domain-specific deep web databases
【24h】

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

机译:E-FFC:针对特定于域的深度Web数据库的增强的,以表单为中心的搜寻器

获取原文
获取原文并翻译 | 示例
           

摘要

A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs' entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs' forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs' forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions' limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs' forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have Keen conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.
机译:在线从大型深层Web数据库(WDB)检索,集成和挖掘丰富和高质量的信息的关键问题是如何在网络中自动有效地发现和识别特定于域的WDB的入口点(即表单)。这是一项具有挑战性的任务,因为具有动态和异构属性的特定于域的WDB表单非常稀疏地分布在几万亿个Web页面上。尽管已为解决该问题及其特殊情况做出了巨大努力,但仍需进一步探索更有效的解决方案,以同时实现令人满意的采伐率和特定领域WDB表格的覆盖率。在本文中,针对特定领域WDB(E-FFC)的增强型表单抓取工具已被提出,作为解决现有解决方案局限性的新颖框架。 E-FFC基于分而治之策略,采用了一系列新颖有效的策略/算法,包括两步页面分类器,链接评分策略,高级可搜索形式和特定于领域形式的分类器,爬网停止条件等等,以同时实现优化的收获率和特定领域WDB表格的覆盖率。在一组代表域中的多个实际Web页面上进行了E-FFC实验,结果表明,就收获率而言,E-FFC的性能优于现有的特定于域的Deep Web Form-focused爬虫。覆盖率和爬网鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号