首页> 外文会议>International Conference on Information Networking >History-enhanced focused website segment crawler
【24h】

History-enhanced focused website segment crawler

机译:历史增强的重点网站段履带

获取原文

摘要

The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a “history-enhanced focused website segment crawler” to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the “history feature”, that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.
机译:聚焦爬行研究中的主要挑战是如何有效地利用计算资源,例如带宽,磁盘空间和时间,找到与特定主题相关的多个网页。为了符合这一挑战,我们之前引入了一种基于机器学习的聚焦履带,旨在抓取位于同一目录路径中的一组相关网页,称为网站段,并且到目前为止已经实现了高效率。我们以前的方法的局限性之一是它可能会反复访问不服务于任何相关网站段的网站,在网站段与培训数据集中的相关联动特征相同的联动特征。在本文中,我们提出了一个“历史增强的聚焦网站履带”来解决问题。背后的想法是,如果履带程序从网站连续下载许多无关的网页,则应减少未公开的网站段的优先级得分。为了实现这个想法,我们提出了一种新的预测特征,称为“历史功能”,该功能从最近的爬行结果中提取,即从目标网站收集的相关和无关网页。我们的实验表明,我们的新提出的功能可以提高我们聚焦履带的爬行效率,最大约5 %。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号