【24h】

HiCrawl: A Hidden Web Crawler for Medical Domain

机译:HiCrawl:用于医疗领域的隐藏的网络爬虫

获取原文

摘要

The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.
机译:隐藏的Web是指WWW的很大一部分,其中包含许多可自由访问的Web数据库,这些数据库隐藏在搜索表单界面的后面,这些表单只能通过响应于在搜索表单界面上发出的用户查询而生成的动态网页进行访问。因此,为隐藏Web实施任何爬网程序的核心挑战是通过自动生成和发布有助于发现这些动态Web页面的查询来常规地超越这些搜索表单界面。本文提供了一种新颖的方法来指导爬虫选择正确的查询词,以将其提交给已设计为接受关键字或词作为其输入的任何搜索表单界面。该系统基于分类层次结构的使用,这些分类层次结构可能是手动构建的,也可能是自动构建的。出于说明目的,我们考虑了“医疗”域中的搜索表单界面,它是研究人员使用的最受欢迎的域之一,并且在同一域中使用手动生成的自上而下的分类层次结构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号