【24h】

A machine learning approach to web mining

机译:一种用于Web挖掘的机器学习方法

获取原文
获取原文并翻译 | 示例

摘要

In thie paper a Web mining tool for content-based classification of Web pages is presented. The tool, named WebClass, can be used for resource discovery purposes. Information considered by the system is both the textual contents of Web pages and the layout structure defined HTML tags. The representation language adopted for Webgapges is the gag-of-words, where words are selected for training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification taks: Decision trees, centroids, and k-nearest-neigbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.
机译:在本文中,提出了一种用于基于内容的网页分类的Web挖掘工具。名为WebClass的工具可用于资源发现目的。系统考虑的信息既是网页的文本内容,又是定义HTML标签的布局结构。 Webgapges所采用的表示语言是字堵词,其中通过一种新颖的评分手段来选择字来训练文档。在分类任务上,根据经验比较了三种不同的分类模型:决策树,质心和k最近邻。报告了实验结果,并得出了有关HTML布局结构用于分类目的的相关性,通过评分方法选择的单词的重要性以及不同分类器的性能的结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号