Web page categorization using hierarchical headings structure

机译：使用分层标题结构的网页分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The goal of Web page categorization is to classify the Web documents into a certain number of predefined categories. The previous works in this area employed a large number of labeled training documents for supervised learning. The problem is that, it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating training documents. Therefore, a new machine learning algorithm should be investigated to overcome these difficulties. We proposed a new algorithm called Iterative Cross-Training (ICT). The paper also present a new feature set which is the hierarchical structure of headings appearing in the Web page to enhance the classification performance. We found that the hierarchical structure of headings has a high impact and could enhance the classification performance.

机译：网页分类的目标是将Web文档分类为一定数量的预定义类别。该领域的先前工作采用了大量带有标签的培训文档进行监督学习。问题在于，创建带标签的培训文档很困难。虽然很容易收集未标记的文档，但是手动分类它们以创建培训文档并非易事。因此，应该研究一种新的机器学习算法来克服这些困难。我们提出了一种称为迭代交叉训练（ICT）的新算法。本文还提出了一个新的功能集，该功能集是出现在网页中的标题的层次结构，以增强分类性能。我们发现标题的层次结构具有很高的影响力，并且可以增强分类性能。

著录项

来源
《Information Technology Interfaces, 2002. ITI 2002. Proceedings of the 24th International Conference on》|2002年|p.37-42|共6页
会议地点
作者
Soonthornphisaj N.; Chartbanchachai P.; Pratheeptham T.; Kijsirikul B.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词
learning (artificial intelligence); classification; Internet; Web page categorization; Web documents; labeled training documents; supervised learning; machine learning algorithm; iterative cross-training; World Wide Web; hierarchical structure; featu;

机译：学习（人工智能）;分类;互联网;网页分类;网络文件;带标签的培训文件;监督学习;机器学习算法;迭代交叉训练;全球资讯网;层次结构;特色;

相似文献

外文文献
中文文献
专利

1. Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content. [J] . Mancini F, Sousa FS, Teixeira FO, Journal of biomedical informatics. . 2011,第2期

机译：葡萄牙语中医疗主题标题（MeSH）的使用，用于对基于Web的医疗保健内容进行分类。
2. A study of the effects of spatial contiguity and hierarchically structured headings in a shipboard operating and maintenance manual [J] . Paul Edward Doherty WMU journal of maritime affairs . 2016,第1期

机译：在船上操作和维护手册中研究空间连续性和层次结构标题的影响
3. Hierarchical structures made of proteins. The complex architecture of spider webs and their constituent silk proteins [J] . Markus Heim, Lin Romertt, Thomas Scheibel Chemical Society Reviews . 2010,第1期

机译：由蛋白质组成的分层结构。蜘蛛网及其构成的丝蛋白的复杂结构
4. Web page categorization using hierarchical headings structure [C] . Nuanwan Soonthornphisaj, Pisit Chartbanchachai, Thanapol Pratheeptham, International Conference on Information Technology Interfaces . 2002

机译：使用层次标题结构的网页分类
5. Combining machine learning and hierarchical structures for text categorization. [D] . Ruiz Ruiz, Miguel Enrique. 2001

机译：结合机器学习和层次结构进行文本分类。
6. Stimulus Type Level of Categorization and Spatial-Frequencies Utilization: Implications for Perceptual Categorization Hierarchies [O] . Assaf Harel, Shlomo Bentin -1

机译：刺激类型分类水平和空间频率利用率：对感知分类层次结构的影响
7. Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content [O] . Mancini Felipe, Sousa Fernando Sequeira, Teixeira Fabio Oliveira, 2011

机译：使用葡萄牙语的医学主题词（MeSH）对基于Web的医疗保健内容进行分类

Web page categorization using hierarchical headings structure

摘要

著录项

相似文献

相关主题

期刊订阅