首页> 中文期刊>计算机工程与应用 >基于预期剩余能量模型的聚焦爬行方法

基于预期剩余能量模型的聚焦爬行方法

     

摘要

How to determine the search direction and depth are the key problem of focused crawling. This paper proposes an expected residual energy based URL priority computing method. This method uses the information of the current web page to calculate the immediately returning energy of hyperlinks, and then updates the expected residual energy using the historical returning knowledge of different historical paths in an iterative way. Using the expected residual energy as the priority and depth limit, this paper presents the system architecture of the expected residual energy based focused crawler, and gives out the detailed implementation of the key modules. Experiment result shows the focused crawler acquires bet-ter topic relevant websites finding ability.%如何确定搜索的方向和深度是聚焦爬行的核心问题.为此,提出了链接的预期剩余能量概念及其计算方法.该方法利用当前页面的信息计算链接的立即回报能量,利用到达同一链接不同历史路径给予的历史回报知识不断迭代更新链接的预期剩余能量.利用预期剩余能量作为链接的优先级和搜索深度限制,设计了基于预期剩余能量模型的聚焦爬行算法,并给出了关键模块的实现.实验结果显示该方法具有更强的主题网站发现能力.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号