The internet is comprised of massive amount of information in the form of zillions of webpages.This information can be categorized into the surface web and the deep web. The existingsearch engines can effectively make use of surface web information.But the deep web remainsunexploited yet. Machine learning techniques have been commonly employed to access deepweb content.Under Machine Learning, topic models provide a simple way to analyze large volumes ofunlabeled text. A "topic" consists of a cluster of words that frequently occur together. Usingcontextual clues, topic models can connect words with similar meanings and distinguishbetween words with multiple meanings. Clustering is one of the key solutions to organize thedeep web databases.In this paper, we cluster deep web databases based on the relevance foundamong deep web forms by employing a generative probabilistic model called Latent DirichletAllocation(LDA) for modeling content representative of deep web databases. This isimplemented after preprocessing the set of web pages to extract page contents and formcontents.Further, we contrive the distribution of “topics per document” and “words per topic”using the technique of Gibbs sampling. Experimental results show that the proposed methodclearly outperforms the existing clustering methods.
展开▼