首页> 外文期刊>Data & Knowledge Engineering >Automatic generation of agents for collecting hidden Web pages for data extraction
【24h】

Automatic generation of agents for collecting hidden Web pages for data extraction

机译:自动生成用于收集隐藏网页以进行数据提取的代理

获取原文
获取原文并翻译 | 示例

摘要

As the Web grows, more and more data has become available under dynamic forms of publication, such as legacy databases accessed by an HTML form (the so called hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of agents that can automatically fetch pages for further processing. As a result, there is an increasing need for tools that can help users generate such agents. In this paper, we describe a method for automatically generating agents to collect hidden Web pages. This method uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some patterns that can be found among Web sites to identify the navigation paths to follow. To demonstrate the accuracy of our method, we discuss the results of a number of experiments carried out with sites from different domains.
机译:随着Web的发展,以动态发布形式提供的数据越来越多,例如通过HTML表单(所谓的隐藏Web)访问的旧数据库。在这种情况下,此数据的集成越来越依赖于代理的快速生成,该代理可以自动获取页面以进行进一步处理。结果,对可以帮助用户生成此类代理的工具的需求日益增长。在本文中,我们描述了一种自动生成代理以收集隐藏Web页面的方法。此方法使用预先存在的数据存储库来标识这些页面的内容,并利用可以在网站之间找到的某些模式来标识要遵循的导航路径。为了证明我们方法的准确性,我们讨论了对来自不同域的站点进行的许多实验的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号