【24h】

An Improved Bloom Filter in Distributed Crawler

机译:分布式搜寻器中改进的Bloom过滤器

获取原文

摘要

Distributed crawlers have brought great value in both business and scientific research by crawling online data resources, while a large number of duplicate url links seriously affect the efficiency of crawlers. The bloom filter represents the set through an array of bits and uses the hash function to query the elements, which improves the efficiency of query data when space utilization is low. However, generating false positive is an inevitable problem for bloom filter. In this paper, the MD5 algorithm is used to pretreat the URL, and an improved multi-dimensional bloom filter algorithm is proposed, which effectively reduces the rate of false positive and improves the efficiency of distributed crawler.
机译:分布式爬网程序通过爬网在线数据资源在业务和科学研究中都带来了巨大的价值,而大量重复的URL链接严重影响了爬网程序的效率。 Bloom筛选器通过位数组表示集合,并使用哈希函数查询元素,从而在空间利用率较低时提高了查询数据的效率。然而,对于布隆过滤器而言,产生假阳性是不可避免的问题。本文采用MD5算法对URL进行预处理,提出了一种改进的多维布隆过滤器算法,可以有效降低误报率,提高分布式爬虫的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号