首页> 外文OA文献 >An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation
【2h】

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

机译:一个改进的聚焦履带:使用网页分类和链接优先评估

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.
机译:聚焦爬虫是主题特定的,并选择性地收集与来自互联网相关的给定主题相关的网页。然而,目前聚焦爬网的性能可以容易地遭受网页和多主题网页的环境的影响。在爬行过程中,由于该页面的总体相关性低,可以忽略高相关区域,并且锚文本或链接上下文可能是误导爬虫。为了解决这些问题,本文提出了一个新的聚焦履带。首先,我们构建基于改进的术语加权方法(ITFIDF)的网页分类器,以获得高度相关的网页。此外,本文介绍了链路的评估方法,链接优先级评估(LPE),它结合了网页内容块分区算法和联合特征评估(JFE)的策略,以更好地判断网页上URL之间的相关性和给定的话题。实验结果表明,使用ITFIDF的分类器优于TFIDF,我们的聚焦履带基于仅基于广度,仅最佳,锚文本,仅在收获率方面基于广度,LINK-CONTERT和内容块分区优于其他聚焦爬虫。和目标召回。总之,我们的方法对聚焦履带具有重要且有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号