Ontology-Based Automatically Hidden Web Portal Index

SONGHui; PANLeyun; MAFanyuan

摘要

Many valuable databases on the Web have non-crawlable contents that are “hidden” behind the search forms. Information is available only by filling out HTML forms manually to query the underlying databases. For accessing data behind forms by automated agents, the critical task is having the corresponding query interfaces of the hidden databases that can be understood by machine. This paper presents an automatic approach of hidden Web portal index for various domains. It discovers and scrapes the query forms from Web pages based the tag-tree presentation, and then interpret them into the uniform mediate interfaces with the aid of domain ontology definition. To achieve high transformation accuracy, the domain ontology is also used to filter out the interfaces that are not related to the specific domain. The query interfaces gained finally represented with common concepts can automatically be indexed and retrieved by program. The experiments indicate that the algorithms used are efficient and the system is materially useful for information system or personalized Web search system to retrieval contents from hidden Web.

机译：Web上许多有价值的数据库都有不可检索的内容，这些内容“隐藏”在搜索表单的后面。仅通过手动填写HTML表单以查询基础数据库才能获得信息。为了通过自动化代理访问表单背后的数据，关键任务是使隐藏数据库具有相应的查询接口，机器可以理解这些接口。本文提出了一种针对各个领域的隐藏Web门户索引的自动方法。它基于标记树表示从网页中发现并刮取查询表单，然后借助域本体定义将它们解释为统一的中介接口。为了获得较高的转换精度，还使用域本体来过滤掉与特定域无关的接口。最终以通用概念表示的查询接口可以由程序自动索引和检索。实验表明所使用的算法是有效的，并且该系统对于信息系统或个性化Web搜索系统从隐藏的Web检索内容具有实质性的帮助。

Ontology-Based Automatically Hidden Web Portal Index

摘要

著录项

相关主题

期刊订阅