首页> 中文期刊>计算机应用 >级联式低消耗大规模网页分类在线获取方法

级联式低消耗大规模网页分类在线获取方法

     

摘要

To balance the contradiction between accuracy and resource cost during constructing an automatic system for collecting massive well-classified Web pages,a cascaded and low-consuming online method for large-scale Web page category acquisition was proposed,which utilizes a cascaded strategy to integrate online and offline Web page classifiers so as to take full of use of their advantages.An online Web page classifier trained by features in the anchor text was used as the first-level classifier,and then the confidence of the classification results was computed by the information entropy of the posterior probability.The second-level classifier was triggered when the confidence is larger than the predefined threshold obtained by Multi-Objective Particle Swarm Optimization (MOPSO).The features were extracted from the downloaded Web pages by the secondary classifier,then they were classified by an offline classifier pre-trained by Web pages.In the comparison experiments with single online classification and single offline classification,the proposed method dramatically increased the F1 measure of classification by 10.85% and 4.57% respectively.Moreover,compared with the single online classification,the efficiency of the proposed method did not decrease a lot (less than 30%),while the efficiency was improved about 70% compared with single offline classification.The results demonstrate that the proposed method not only has a more powerful classification ability,but also significantly reduces the computing overhead and bandwidth consumption.%针对海量网页在线自动高效获取网页分类系统设计中如何更有效地平衡准确度与资源开销之间的矛盾问题,提出一种基于级联式分类器的网页分类方法.该方法利用级联策略,将在线与离线网页分类方法结合,各取所长.级联分类系统的一级分类采用在线分类方法,仅利用锚文本中网页标题包含的特征预测其分类,同时计算分类结果的置信度,分类结果的置信度由分类后验概率分布的信息熵度量.若置信度高于阈值(该阈值采用多目标粒子群优化算法预先计算取得),则触发二级分类器.二级分类器从下载的网页正文中提取特征,利用预先基于网页正文特征训练的分类器进行离线分类.结果表明,相对于单独的在线法和离线法,级联分类系统的F1值分别提升了10.85%和4.57%,并且级联分类系统的效率比在线法未降低很多(30%左右),而比离线法的效率提升了约70%.级联式分类系统不仅具有更高的分类能力,而且显著地减少了分类的计算开销与带宽消耗.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号