Hadoop下改进布隆过滤器算法的网页去重

黄伟建; 杨海龙

首页> 中文期刊> 《计算机工程与科学》 >Hadoop下改进布隆过滤器算法的网页去重

Hadoop下改进布隆过滤器算法的网页去重

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

To solve the space waste problem existing in the server space where a lot of duplicated and similar data are stored,we propose an improved Bloom Filter algorithm,which adds an array of bit and dynamically optimizes the number of copies of duplicated data according to the weight calculated by the repeated hits of the bit array.Then,the improved algorithm is parallelized in the Hadoop distributed cluster to further improve the processing efficiency.Experimental results show that compared with traditional web duplicate removal algorithms,the improved Bloom filter algorithm can not only improve the processing efficiency of jobs,but also save the server storage space to a certain extent by dynamically optimizing the number of copies of duplicated data according to the repeated hits of the bit array.%针对服务器中存储的大量重复和相似数据造成的空间浪费问题,改进的布隆过滤器(Bloom Filter)算法通过增加位数组并根据位数组的重复命中次数所计算的权重来动态优化重复数据的副本数,然后在Hadoop分布式集群下对改进的算法进行并行实现,以进一步提高作业处理效率.实验结果表明,与传统网页去重算法相比,改进的Bloom Filter算法的并行实现不仅提高了作业的处理效率,而且通过基于位数组下动态重复次数对副本数的优化,在一定程度上节省了服务器的存储空间.

著录项

来源
《计算机工程与科学》 |2017年第2期|285-290|共6页
作者
黄伟建; 杨海龙;
展开▼
作者单位

河北工程大学信息与电气工程学院;

河北邯郸056038;

河北工程大学信息与电气工程学院;

河北邯郸056038;

展开▼
原文格式 PDF
正文语种 chi
中图分类理论、方法;
关键词
Hadoop; 布隆过滤器; 副本数; MapReduce;

相似文献

中文文献
外文文献
专利

1. 基于布隆过滤器的网页搜索去重方法 [J] . 黄恩博 . 现代计算机（专业版） . 2013,第014期
2. 一种基于特征向量的改进DSC网页去重算法 [J] . 徐朝辉 ,赵淑梅 ,闫付亮 . 科学技术与工程 . 2013,第008期
3. 网页去重的改进算法 [J] . 王静 ,刘观宁 ,张钰辉 . 微型机与应用 . 2011,第012期
4. 基于网页正文结构和特征串的相似网页去重算法 [J] . 熊忠阳 ,牙漫 ,张玉芳 . 计算机应用 . 2013,第002期
5. 基于网页正文逻辑段落和长句提取的网页去重算法 [J] . 张小娣 ,宋余庆 . 图书情报研究 . 2012,第002期
6. 基于版权信息的新闻网页去重算法 [C] . 杨邵玉 ,梁正友 . 中国计算机用户协会网络应用分会2008年网络新技术与应用研讨会 . 2008
7. 布隆过滤器在网页去重中的研究与应用 [A] . 黄涛 . 2013

Hadoop下改进布隆过滤器算法的网页去重

摘要

著录项

相似文献

相关主题

期刊订阅