...
首页> 外文期刊>Journal of supercomputing >An effective approach to enhancing a focused crawler using Google
【24h】

An effective approach to enhancing a focused crawler using Google

机译:使用谷歌加强聚焦履带的有效方法

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of thefocusedcrawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on ageneralweb crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we callSlideCrawler, has been used for our vertical search engineCourseSharesince the fall of 2011. The capability of SlideCrawler was verified for the top-500 world wide universities. SlideCrawler collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files.
机译:在本文中,我们分享我们在增强专注的履带式履历引擎中,这些经验旨在与学术幻灯片一起使用。 FocusedCrawler的目标是从学术机构收集Microsoft PowerPoint文件。基于AgeneralWeb爬虫的先前方法无法收集足够数量的文件,主要是因为机器人排除协议和缺少的超链接。作为对这些问题的补救措施,我们提出了一种组合方法,其中由诸如谷歌等一般网络搜索引擎维护的索引信息用于通过我们的查询生成器来实现目标URL列表,然后由我们的URL提取器和文件下载器补充。因为谷歌已经爬出了数十亿的网页,所以系统地检索从谷歌的所需信息比从划伤到自己的重做更具成本效益和潜在的有效。我们关注的履带式履带式履历器已被用于我们的垂直搜索EnginecourseSharesince 2011年秋季。幻灯片的能力验证了全球前500名全球大学。 Slidecridler从前500名大学收集了大约一百万个文件。此外,研究结果表明,SLIDECRAWLER优于NUTCH,收集更多幻灯片文件的3.7倍。

著录项

  • 来源
    《Journal of supercomputing 》 |2020年第10期| 8175-8192| 共18页
  • 作者单位

    Korea Adv Inst Sci & Technol Grad Sch Knowledge Serv Engn Daejeon South Korea;

    Korea Adv Inst Sci & Technol Grad Sch Knowledge Serv Engn Daejeon South Korea;

    Korea Adv Inst Sci & Technol Grad Sch Knowledge Serv Engn Daejeon South Korea;

    Korea Adv Inst Sci & Technol Grad Sch Knowledge Serv Engn Daejeon South Korea;

    Korea Adv Inst Sci & Technol Grad Sch Knowledge Serv Engn Daejeon South Korea;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web crawler; Focused crawler; Google; Vertical search engine;

    机译:Web履带;聚焦履带;谷歌;垂直搜索引擎;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号