首页> 外文期刊>Computer standards & interfaces >Empirical evaluation of the link and content-based focused Treasure-Crawler
【24h】

Empirical evaluation of the link and content-based focused Treasure-Crawler

机译:基于链接和基于内容的宝藏搜寻器的实证评估

获取原文
获取原文并翻译 | 示例

摘要

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler employs a significant and unique algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of specific information retrieval criteria such as recall and precision, both with values close to 50%. Gaining such outcome asserts the significance of the proposed approach.
机译:随着Web的规模和分布呈指数增长,索引Web成为搜索引擎一项艰巨的任务。当前,克服此问题的最有效的已知方法是使用聚焦爬虫。专注的爬虫采用重要且独特的算法来检测Web上与其感兴趣主题有关的页面。为此,我们提出了一种自定义方法,该方法使用页面的特定HTML元素来预测当前页面中具有未访问链接的所有页面的主题焦点。这些公认的主题页面必须稍后根据其与爬网程序主要主题的相关性进行排序,以进行进一步的实际下载。在“宝藏搜寻器”中,我们使用称为T-Graph的分层结构,这是为每个未访问的链接分配适当的优先级分数的示例性指南。这些URL稍后将根据此优先级进行下载。本文体现了Treasure-Crawler系统的实施,测试结果和性能评估。会根据特定的信息检索标准(例如召回率和精确度)对“宝物抓取工具”进行评估,两者的值都接近50%。获得这样的结果肯定了所提出方法的重要性。

著录项

  • 来源
    《Computer standards & interfaces》 |2016年第2期|54-62|共9页
  • 作者单位

    Department of Computer Science, The George Washington University, Washington DC, United States;

    Computer Networks and Security Laboratory (LARCES), State University of Ceara (UECE), Fortaleza, Ceara, Brazil,Faculty of Science, Engineering and Computing, Kingston University, United Kingdom;

    Computer Networks and Security Laboratory (LARCES), State University of Ceara (UECE), Fortaleza, Ceara, Brazil;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Focused Web crawler; T-Graph; HTML data; Information retrieval; Search engine;

    机译:专注于Web爬虫;T图HTML数据;信息检索;搜索引擎;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号