The world wide web contains vast amounts of data, but only a small portion of it is accessible in an operational form by machines. The rest of this vast collection is behind a presentation layer that renders web pages in a human-friendly form but also hampers machine-processing of data. The task of converting web data into operational form is the task of data extraction. Current approaches to data extraction from the web either require human-effort to guide supervised learning algorithms or are customized to extract a narrow range of data types in specific domains. We focus on the broader problem of discovering the underlying structure of any database-generated web site. Our approach automatically discovers relational data that is hidden behind these web sites by combining experts that identify the relationship between surface structure and the underlying structure.;Our approach is to have a set of software experts that analyze a web site's pages. Each of these experts is specialized to recognize a particular type of structure. These experts discover similarities between data items within the context of the particular types of structure they analyze and output their discoveries as hypotheses in a common hypothesis language. We find the most likely clustering of data using a probabilistic framework in which the hypotheses provide the evidence. From the clusters, the relational form of the data is derived.;We develop two frameworks following the principles of our approach. The first framework introduces a common hypothesis language in which heterogeneous experts express their discoveries. The second framework extends the common language to allow experts to assign confidence scores to their hypotheses.;We experiment in the web domain by comparing the output of our approach to the data extracted by a supervised wrapper-induction system and validated manually. Our results show that our approach performs well in the data extraction task on a variety of web sites.;Our approach is applicable to other structure discovery problems as well. We demonstrate this by successfully applying our approach in the record deduplication domain.
展开▼