The Web so far has been incredibly successful at deliveringinformation to human users. So successful actually, that there is now anurgent need to go beyond a browsing human and make informationaccessible to applications, in order to offer automation,inter-operation and Web-awareness among services. To do so, informationfrom Web sources needs to be accessible in a structured way. XML and itsvarious extensions (data-models, query languages) are a step in thisdirection. Unfortunately, the Web is not yet a well organized repositoryof nicely structured XML documents but rather a conglomerate of volatileHTML pages, for which structure has to be extracted. To address thisproblem, we present the World Wide Web Wrapper Factory (W4F), a Javatoolkit for the generation of wrappers for Web sources. Our maincontributions are: (1) an expressive language to specify the extractionof complex structures from HTML pages; (2) a declarative mapping to XMLdocuments, with the automatic generation of the corresponding DTDs; (3)some visual supports to make the engineering of wrappers faster andeasier As an illustration, we show how we can, via W4F intermediation,transparently query HTML sources from an XML query language
展开▼