...
首页> 外文期刊>Journal of Computational Methods in Sciences and Engineering >UCrawler: A learning-based web crawler using a URL knowledge base
【24h】

UCrawler: A learning-based web crawler using a URL knowledge base

机译:Ucrawler:使用URL知识库的基于学习的Web爬网

获取原文
获取原文并翻译 | 示例
           

摘要

Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.
机译:以垂直搜索引擎的基本组件为重点爬虫,侧重于爬行与特定主题相关的网页。现有的聚焦爬虫通常遭受爬行页面效率低的问题和主题迁移。在本文中,我们提出了一种使用URL知识库的基于学习的聚焦履带。为了提高相似度的准确性,通过父页面内容,锚点信息和URL内容来测量主题的相似性。 URL内容也被学习并迭代并连续更新。在履带内,我们基于内容分析和简​​单链路分析履带策略的组合来实现爬行机制,这降低了计算复杂性并避免了爬行的地方问题。实验结果表明,我们所提出的算法比传统方法实现更好的精确度,包括鲨鱼搜索和最佳第一搜索算法,并避免了爬行的局部最佳问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号