首页> 外文会议>Semantics, Knowledge and Grid, 2009. SKG 2009 >Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs
【24h】

Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs

机译:分类器引导的主题搜寻器:一种自动标记肯定URL的新方法

获取原文

摘要

It is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with WebPages which are labeled manually or extracted from the Open Directory Project (ODP), and then the classifiers judge the topical relevance of WebPages pointed to by hyperlinks in the crawler frontier. Though one can obtain labeled WebPages with comparative ease, however, training the classifiers with WebPages violates the overall hypothesis of machine learning about i.i.d (Independent and Identical Distribution) between training and testing sets because the classification instances are hyperlinks (URLs) instead of WebPages. For the reason, this paper investigates and proposes a novel method based on templates for automatically labeling the positive URLs to develop classifier-guided topical crawlers. A series of off-line and on-line experiments are performed extensively. The results demonstrate that the classifier-guided topical crawler trained with labeled URLs has higher recall than the one trained with labeled WebPages. The results also prove that the classifier using immediate vicinity of hyperlinks and the corresponding anchor texts leads the crawler to attain harvest rate of about 95%.
机译:这是分类指导的局部爬虫获得标记训练样本的关键因素。近来,许多这样的分类器都使用手动标记或从开放目录项目(ODP)中提取的WebPage进行训练,然后这些分类器判断由爬网程序边界中的超链接指向的WebPage的主题相关性。尽管可以相对轻松地获得带标签的WebPages,但是使用WebPages训练分类器违反了机器学习有关训练集和测试集之间的i.i.d(独立和相同分布)的总体假设,因为分类实例是超链接(URL)而不是WebPages。因此,本文研究并提出了一种基于模板的新方法,该方法可自动标记正URL,以开发分类器引导的主题爬虫。广泛进行了一系列离线和在线实验。结果表明,使用标签URL训练的分类器指导的主题爬虫比使用标签WebPages训练的分类器具有更高的召回率。结果还证明,使用超链接紧邻的分类器和相应的锚文本使爬虫达到约95%的收获率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号