【24h】

Web page categorization using hierarchical headings structure

机译:使用分层标题结构的网页分类

获取原文

摘要

The goal of Web page categorization is to classify the Web documents into a certain number of predefined categories. The previous works in this area employed a large number of labeled training documents for supervised learning. The problem is that, it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating training documents. Therefore, a new machine learning algorithm should be investigated to overcome these difficulties. We proposed a new algorithm called Iterative Cross-Training (ICT). The paper also present a new feature set which is the hierarchical structure of headings appearing in the Web page to enhance the classification performance. We found that the hierarchical structure of headings has a high impact and could enhance the classification performance.
机译:网页分类的目标是将Web文档分类为一定数量的预定义类别。该领域的先前工作采用了大量带有标签的培训文档进行监督学习。问题在于,创建带标签的培训文档很困难。虽然很容易收集未标记的文档,但是手动分类它们以创建培训文档并非易事。因此,应该研究一种新的机器学习算法来克服这些困难。我们提出了一种称为迭代交叉训练(ICT)的新算法。本文还提出了一个新的功能集,该功能集是出现在网页中的标题的层次结构,以增强分类性能。我们发现标题的层次结构具有很高的影响力,并且可以增强分类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号