In thie paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined HTML tags. The representation language adopted for Webgapges is the gag-of-words, where words are selected for training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification taks: Decision trees, centroids, and k-nearest-neigbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.
展开▼