【24h】

Focused Crawling Using Navigational Rank

机译:重点爬行使用导航等级

获取原文

摘要

The goal of focused crawling is to use limited resources to effectively discover web pages related to a specific topic rather than downloading all accessible web documents. The major challenge in focused crawling is how to effectively determine each hyperlink's capability of leading to target pages. To compute this capability, we1 present a novel approach, called Navigational Rank (NR). In general, NR is a kind of two-step and two-direction credit propagation approach. Compared to existing methods, NR mainly has three advantages. First, NR is dynamically updated during the crawling progress, which can adapt to different website structures very well. Second, when the crawling seed is far away from the target pages, and the target pages only constitute a small portion of the whole website, NR shows a significant performance advantage. Third, NR computes each link's capability of leading to target pages by considering both the target and non-target pages it leads to. This global knowledge causes a better performance than only using target pages. We have performed extensive experiments for performance evaluation of the proposed approach using two groups of large-scale, real-world datasets from two different domains. The experimental results show that our approach is domain-independent and significantly outperforms the state-of-arts.
机译:聚焦爬网的目标是使用有限的资源来有效地发现与特定主题相关的网页而不是下载所有可访问的Web文档。聚焦爬网中的主要挑战是如何有效地确定每个超链接的传导才能导致目标页面的能力。为了计算这种能力,We1提出了一种新的方法,称为导航等级(NR)。通常,NR是一种两步和双向信用传播方法。与现有方法相比,NR主要有三个优点。首先,在爬行进度期间,NR动态更新,这可以很好地适应不同的网站结构。其次,当爬行的种子远离目标页面时,目标网页只构成整个网站的一小部分,NR表示显着的性能优势。第三,NR通过考虑它导致的目标和非目标页面来计算每个链路的通向目标页面的能力。这种全局知识导致比仅使用目标页面更好的性能。我们已经对使用来自两个不同域的两组大型现实世界数据集进行了广泛的实验,以便使用两组大型现实世界数据集。实验结果表明,我们的方法是独立的,明显优于最先进的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号