...
首页> 外文期刊>Data & Knowledge Engineering >Improving the performance of focused web crawlers
【24h】

Improving the performance of focused web crawlers

机译:改善重点网络抓取工具的性能

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.
机译:这项工作解决了与重点爬虫的设计和实现有关的问题。提出了几种依靠网页内容和链接信息来估计网页与给定主题的相关性的最新爬虫。特别强调的爬虫不仅能够学习相关页面的内容(就像传统的爬虫一样),而且还能够学习通往相关页面的路径。还描述了一种新颖的学习爬虫,该爬虫受到以前提出的隐马尔可夫模型(HMM)爬虫的启发。搜寻器使用相同的基准实现(仅优先级分配函数在每个搜寻器中有所不同)实现,从而提供了一个公正的评估框架来对其性能进行比较分析。当网页内容和(链接)锚文本的组合用于为网页分配下载优先级时,所有搜寻器都将发挥最大的性能。此外,新的HMM搜寻器提高了原始HMM搜寻器的性能,并且在搜索特殊主题方面也胜过经典的集中式搜寻器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号