Many valuable databases on the Web have non-crawlable contents that are “hidden” behind the search forms. Information is available only by filling out HTML forms manually to query the underlying databases. For accessing data behind forms by automated agents, the critical task is having the corresponding query interfaces of the hidden databases that can be understood by machine. This paper presents an automatic approach of hidden Web portal index for various domains. It discovers and scrapes the query forms from Web pages based the tag-tree presentation, and then interpret them into the uniform mediate interfaces with the aid of domain ontology definition. To achieve high transformation accuracy, the domain ontology is also used to filter out the interfaces that are not related to the specific domain. The query interfaces gained finally represented with common concepts can automatically be indexed and retrieved by program. The experiments indicate that the algorithms used are efficient and the system is materially useful for information system or personalized Web search system to retrieval contents from hidden Web.
展开▼