基于窗口比较的网站信息增量爬取方法

刘学; 麻朴方; 尤佳莉; 脱立恒

首页> 中文期刊>网络新媒体技术 >基于窗口比较的网站信息增量爬取方法

基于窗口比较的网站信息增量爬取方法

开具论文收录证明 >>

期刊封面封底目录下载 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Nowadays,Bloom filters are useful methods for the incremental crawling of websites.With the increasing of the stored items,the error rate is also enlarged.To solve this problem,we proposed a window comparison based incremental crawling approach,by which the information on the websites can be crawled within the limited length,and it will be stored in the data queue based on the display location in the website.A window is set at the end of the queue,which is used to check how much data is crawled by several times and whether the crawling process should be terminated.The simulation shows that,this approach can reduce the cost of the crawling for the website in which the incremented data is not displayed based on the updated time.%目前在网站信息增量爬取中,采用布隆过滤器去重是比较有效的方法,但随着存入的元素数量增加,误算率随之增加.为此本文设计并实现了一种基于窗口比较的网站信息增量爬取方法,按照网站数据呈现顺序一次性爬取有限长度的数据,并按照网站数据的呈现顺序放入数据队列,在数据队列末端设定比较窗口,通过检查比较窗口内的数据与已爬取数据的重复度决定是否停止数据爬取.实验表明,针对增量爬取未严格按照时间排序网站信息时,本方法降低了爬取损耗.

著录项

来源
《网络新媒体技术》|2017年第4期|24-27|共4页
作者
刘学; 麻朴方; 尤佳莉; 脱立恒;
展开▼
作者单位

中国科学院声学研究所国家网络新媒体工程技术研究中心北京100190;

中国科学院声学研究所国家网络新媒体工程技术研究中心北京100190;

中国科学院大学北京100190;

中国科学院声学研究所国家网络新媒体工程技术研究中心北京100190;

中国科学院声学研究所国家网络新媒体工程技术研究中心北京100190;

展开▼
原文格式 PDF
正文语种 chi
中图分类
关键词
增量爬取; 爬取效率; Hash; 布隆过滤器;

相似文献

中文文献
外文文献
专利

1. 基于窗口比较的网站信息增量爬取方法 [J] . 刘学1 ,麻朴方12 ,尤佳莉1 . 网络新媒体技术 . 2017,第004期
2. 基于窗口队列的信道信息增量智能爬取仿真 [J] . 徐金梅 . 计算机仿真 . 2019,第011期
3. 基于Python的招聘网站信息爬取与数据分析 [J] . 刘晓知 . 电子测试 . 2020,第012期
4. 基于Python的招聘网站信息爬取与数据分析 [J] . 王芳 . 微型机与应用 . 2019,第008期
5. 基于Python的招聘网站信息爬取与数据分析 [J] . 王芳1 . 信息技术与网络安全 . 2019,第008期
6. 增量爬取技术的策略框架设计 [C] . Chen Cheng ,陈诚 ,Li Guangya . 第29届中国数据库学术会议 . 2012
7. 基于增量式爬取和非文本内容评估的网站无障碍检测系统 [A] . 徐峰 . 2014

基于窗口比较的网站信息增量爬取方法

摘要

著录项

相似文献

相关主题

期刊订阅