首页> 外文会议>International Conference on Enterprise Information Systems (ICEIS 2001), Jul 7-10, 2001, Setubal, Portugal >THE EFFECTS OF DIFFERENT FEATURE SETS ON THE WEB PAGE CATEGORIZATION PROBLEM USING THE ITERATIVE CROSS-TRAINING ALGORITHM
【24h】

THE EFFECTS OF DIFFERENT FEATURE SETS ON THE WEB PAGE CATEGORIZATION PROBLEM USING THE ITERATIVE CROSS-TRAINING ALGORITHM

机译:迭代交叉训练算法对不同特征集对网页分类的影响

获取原文
获取原文并翻译 | 示例

摘要

The paper presents the effects of different feature sets on the Web page categorization problem. These features are words appearing in the content of a Web page, words appearing on the hyperlinks, which link to the page and words appearing on every headings in the page. The experiments are conducted using a new algorithm called the Iterative Cross-Training algorithm (ICT) which was successfully applied to Thai Web page identification. The main concept of ICT is to iteratively train two sub-classifiers by using unlabeled examples in crossing manner. We compare ICT against supervised naieve Bayes classifier and Co-Training classifier. The experimental results show that ICT obtains the highest performance and the heading feature is considerably succeed in helping classifiers to build the correct model used in the Web page categorization task.
机译:本文介绍了不同功能集对网页分类问题的影响。这些功能包括出现在网页内容中的单词,出现在链接到页面的超链接上的单词以及出现在页面中每个标题上的单词。实验使用一种称为迭代交叉训练算法(ICT)的新算法进行,该算法已成功应用于泰国网页识别。 ICT的主要概念是通过交叉使用未标记的示例来迭代地训练两个子分类器。我们将ICT与监督的朴素贝叶斯分类器和联合培训分类器进行了比较。实验结果表明,ICT获得了最高的性能,并且标题功能在帮助分类器建立用于网页分类任务的正确模型方面相当成功。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号