首页> 外文期刊>International journal of metaheuristics >Enhanced and combined centroid-based approach for multi-label genre classification of web pages
【24h】

Enhanced and combined centroid-based approach for multi-label genre classification of web pages

机译:基于增强和组合质心的网页多标签流派分类方法

获取原文
获取原文并翻译 | 示例
       

摘要

This paper proposes an enhanced and combined centroid-based approach to classify web pages by genre. To deal with the complexity of web pages, the proposed approach implements a multi-label classification scheme in which a web page can be affected to more than one genre. In addition, it implements an incremental classification to handle the rapid evolution of web genres. In this classification, web pages are classified one by one, according to the similarity between the new page and each genre centroid, our approach either adjusts the genre centroid or considers the new page as noise page and discards it. Moreover, our approach combines three homogenous and centroid-based classifiers: contextual, logical and hyper link classifiers. These classifiers exploit the character n-grams extracted from different sources which are URL, title, headings and anchors. Experiments conducted using a known multi-label corpus showing that our approach is very fast and outperforms many other multi-label classifiers.
机译:本文提出了一种增强的,基于质心的组合方法,用于按类型对网页进行分类。为了处理网页的复杂性,所提出的方法实现了一种多标签分类方案,其中网页可以被影响不止一种类型。此外,它实现了增量分类以处理网络类型的快速发展。在这种分类中,根据新页面与每个流派质心之间的相似性,将网页逐一分类,我们的方法要么调整流派质心,要么将新页面视为杂音页并将其丢弃。此外,我们的方法结合了三个基于质心的同类分类器:上下文分类器,逻辑分类器和超链接分类器。这些分类器利用从不同来源(URL,标题,标题和锚点)提取的字符n-gram。使用已知的多标签语料库进行的实验表明,我们的方法非常快速,并且优于许多其他的多标签分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号