首页> 外文期刊>Computers & mathematics with applications >A Cross-language Focused Crawling Algorithm Based On Multiple Relevance Prediction Strategies
【24h】

A Cross-language Focused Crawling Algorithm Based On Multiple Relevance Prediction Strategies

机译:基于多相关性预测策略的跨语言关注爬网算法

获取原文
获取原文并翻译 | 示例

摘要

Focused crawling is increasingly seen as a solution to address the scalability limitations of existing general-purpose search engines, by traversing the Web to only gather pages that are relevant to a specific topic. How to predict the relevance of the unvisited pages pointed to by candidate URLs in the crawling frontier to a given topic is a key issue in the design of focused crawlers. In this paper, we propose a novel approach based on multiple relevance prediction strategies to address this problem. For cross-language crawling, we first introduce a hierarchical taxonomy to describe topics in both English and Chinese. We then present a formal description of the relevance predicting process and discuss four strategies that make use of page contents, anchor texts, URL addresses and link types of Web pages, respectively, to evaluate the relevance more accurately, in which we propose a particular strategy using Chinese URL addresses to estimate the relevance of cross-language Web pages. Finally, we get a new focused crawling algorithm (FCMRPS, Focused Crawling based on Multiple Relevance Prediction Strategies) based on the combination of these strategies and Shark-Search, which is a classic focused crawling algorithm. Experiments show that the FCMRPS is more effective than the traditional algorithms, namely Breadth-First, Best-First and Shark-Search, in terms of precision and sum of information.
机译:通过遍历Web以仅收集与特定主题相关的页面,聚焦爬网越来越被视为解决现有通用搜索引擎可伸缩性限制的解决方案。如何预测爬网领域中候选URL指向的未访问页面与给定主题的相关性,是设计集中爬网程序的关键问题。在本文中,我们提出了一种基于多种相关性预测策略的新颖方法来解决此问题。对于跨语言爬网,我们首先介绍一种层次分类法,以英语和汉语来描述主题。然后,我们对相关性预测过程进行正式描述,并讨论四种策略,分别利用页面内容,锚文本,URL地址和Web页面的链接类型来更准确地评估相关性,在此基础上,我们提出了一种特殊的策略使用中文URL地址来估计跨语言网页的相关性。最后,基于这些策略和Shark-Search的结合,我们得到了一种新的聚焦爬行算法(FCMRPS,基于多重相关性预测策略的聚焦爬行),这是经典的聚焦爬行算法。实验表明,FCMRPS在精度和信息总和方面比传统算法广度优先,最佳优先和鲨鱼搜索更有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号