首页> 外文会议>Intelligent Information Processing and Web Mining; Advances in Soft Computing >Automated Classification of Web Documents into a Hierarchy of Categories
【24h】

Automated Classification of Web Documents into a Hierarchy of Categories

机译:将Web文档自动分类为类别层次结构

获取原文

摘要

In this paper, the problem of classifying a HTML documents into a hierarchy of categories is investigated in the context of cooperative information repository, named WebClassII. The hierarchy of categories is involved in all aspects of automated document classification, namely feature extraction, learning, and classification of a new document. Innovative aspects of this work are: a) an experimental study on actual Web documents which can be associated to any node in the hierarchy; b) the feature selection process; c) the automated selection of thresholds for the score returned by a classifier; d) the comparison of three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets); e) the definition of new measures for the evaluation of system performances. Results show that the use of hierarchical training sets improves the hierarchical techniques.
机译:在本文中,在名为WebClassII的协作信息存储库的上下文中研究了将HTML文档分类为类别层次结构的问题。类别的层次结构涉及自动文档分类的所有方面,即特征提取,学习和新文档的分类。这项工作的创新之处是:a)对可以与层次结构中的任何节点相关联的实际Web文档的实验研究; b)特征选择过程; c)自动选择分类器返回的分数的阈值; d)三种不同技术的比较(扁平化,具有适当训练集的分层,具有分层训练集的分层); e)定义用于评估系统性能的新措施。结果表明,分层训练集的使用改进了分层技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号