首页> 外文期刊>New Generation Computing >Efficient Topical Focused Crawling Through Neighborhood Feature
【24h】

Efficient Topical Focused Crawling Through Neighborhood Feature

机译:通过邻域功能进行有效的主题集中爬行

获取原文
获取原文并翻译 | 示例

摘要

A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the "neighborhood feature". This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.
机译:重点突出的Web搜寻器是收集国家Web语料库,垂直搜索引擎等所使用的特定领域数据的必要工具,因为它比常规的“广度优先”或“深度优先”搜寻器更有效。集中爬网研究中的问题是,在爬网边界中对未访问的网页进行优先排序,然后按照优先级顺序对这些网页进行爬网。在许多重点爬网研究中采用的最常见功能是对未访问的网页进行优先级排序,这是其源网页集(即其链接的网页)的相关性。但是,此功能是有限的,因为如果我们的源网页很少,我们将无法正确估计未访问网页的相关性。为了解决此问题并提高有针对性的Web爬虫的效率,我们提出了一项新功能,称为“邻居功能”。这使得能够采用其他已经下载的网页来估计目标网页的优先级。另外采用的网页包括与目标网页位于相同目录的网页以及目录路径与目标网页的目录路径相似的网页。我们的实验结果表明,增强型集中式爬虫的性能优于未使用邻域功能的爬虫,以及包括HMM爬虫在内的最先进的集中式爬虫。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号