UCrawler: A learning-based web crawler using a URL knowledge base

Wang Wei; Yu Lihua

首页> 外文期刊>Journal of Computational Methods in Sciences and Engineering >UCrawler: A learning-based web crawler using a URL knowledge base

【24h】

UCrawler: A learning-based web crawler using a URL knowledge base

机译：Ucrawler：使用URL知识库的基于学习的Web爬网

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.

机译：以垂直搜索引擎的基本组件为重点爬虫，侧重于爬行与特定主题相关的网页。现有的聚焦爬虫通常遭受爬行页面效率低的问题和主题迁移。在本文中，我们提出了一种使用URL知识库的基于学习的聚焦履带。为了提高相似度的准确性，通过父页面内容，锚点信息和URL内容来测量主题的相似性。 URL内容也被学习并迭代并连续更新。在履带内，我们基于内容分析和简单链路分析履带策略的组合来实现爬行机制，这降低了计算复杂性并避免了爬行的地方问题。实验结果表明，我们所提出的算法比传统方法实现更好的精确度，包括鲨鱼搜索和最佳第一搜索算法，并避免了爬行的局部最佳问题。

著录项

来源
《Journal of Computational Methods in Sciences and Engineering》 |2021年第2期|461-474|共14页
作者
Wang Wei; Yu Lihua;
展开▼
作者单位

Hangzhou Med Coll Comp Lab Hangzhou Zhejiang Peoples R China;

Netease Hangzhou Network Ltd Hangzhou Zhejiang Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Focused crawler; crawling strategy; URL knowledge base; URL learning;

机译：聚焦履带;爬行策略;网址知识库;网址学习;

相似文献

外文文献
中文文献
专利

1. Machine Learning-Based Topical Web Crawler: An Ensemble Approach Incorporating Meta-Features [J] . Tae Jun Kim, Han- Joon Kim Journal of Engineering & Applied Sciences . 2017,第18期

机译：基于机器学习的主题Web履带：一个包含元特征的合并方法
2. A Learning-Based Framework for Improving Querying on Web Interfaces of Curated Knowledge Bases [J] . Zhang Wei Emma, Sheng Quan Z., Yao Lina, ACM Transactions on Internet Technology . 2018,第3期

机译：基于学习的框架，用于改进策划知识库的Web界面查询
3. An approach for selecting seed URLs of focused crawler based on user-interest ontology [J] . YaJun Du, YuFeng Hai, ChunZhi Xie, Applied Soft Computing . 2014,第Pta3期

机译：一种基于用户兴趣本体的聚焦爬虫种子URL选择方法
4. URL ordering based performance evaluation of Web crawler [C] . Shoaib Mohammed, Maurya Ajay Kumar 2014 International Conference on Advances in Engineering and Technology Research . 2014

机译：基于URL排序的Web爬网程序性能评估
5. Culture, knowledge, and learning: Examining the relationship between learning-based culture, knowledge sharing success, and higher order learning [D] . Minassian, Christopher D. 2007

机译：文化，知识和学习：研究基于学习的文化，知识共享成功和高阶学习之间的关系
6. Intelligent Image-Based Railway Inspection System Using Deep Learning-Based Object Detection and Weber Contrast-Based Image Comparison [O] . Jinbeum Jang, Minwoo Shin, Sohee Lim, 2019

机译：基于深度学习的目标检测和基于Weber对比度的图像比较的基于图像的铁路智能检查系统
7. An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler [O] . Debashis Hati, Amritesh Kumar 2011

机译：基于分数和链接得分的聚焦爬虫识别URL方法

UCrawler: A learning-based web crawler using a URL knowledge base

摘要

著录项

相似文献

相关主题

期刊订阅