Smart Crawler for Harvesting Deep web with Multi-Classification

机译：具备多分类功能的深层Web收获智能履带

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent era data available on the internet is playing a vital role. According to research, most precious and valuable data is present in the deep web so interest in techniques to efficiently site invisible web is increasing. The challenges to extract the deep web are requirement of huge volume of resources, dynamic nature of the deep web, coverage of a wider area of deep web and higher efficiency of collected results from deep web with accuracy. Along with all the above challenges, user demand of privacy and identity is to be maintained. In this paper we proposed a smart crawler that efficiently searches the deep web and avoids visiting irrelevant pages. A smart crawler starts crawling from the center page of seed URL and goes on crawling till the last link available. The crawler has an ability to separate active and inactive links based on requests to sever of hyperlink. The crawler also contains text-based site classifier using neural network and natural language processing as Term Frequency Inverse Document Frequency and Bag of Words with supervised machine learning techniques as logistic regression, support vector machine and naive bayes. Also HTML tags are extracted from hyperlinks along with data which plays a huge role in data analysis and all this is separately saved in a centralized database. Our experimental results with efficient link reaping rate and classification show higher accuracy compared to different crawlers.

机译：在最近的时代，互联网上可用的数据起着至关重要的作用。根据研究，最有价值的数据都存在于深层网络中，因此人们对有效地定位不可见网络的技术的兴趣正在增加。提取深层Web的挑战是需要大量资源，深层Web的动态性质，覆盖更深的深层Web区域以及从深层Web准确地收集结果的更高效率的要求。除了上述所有挑战之外，还必须保持用户对隐私和身份的需求。在本文中，我们提出了一种智能搜寻器，该搜寻器可以有效地搜索深层网络并避免访问不相关的页面。一个智能爬网程序将从种子URL的中心页面开始爬网，然后继续爬网直到最后一个可用链接。爬网程序具有根据对超链接的切断请求来分离活动链接和非活动链接的功能。搜寻器还包含基于文本的站点分类器，该分类器使用神经网络和自然语言处理（如词频逆文档频率和词袋）以及受监督的机器学习技术（如逻辑回归，支持向量机和朴素贝叶斯）来进行。此外，还会从超链接中提取HTML标签以及在数据分析中起着重要作用的数据，所有这些都分别保存在集中式数据库中。与不同的搜寻器相比，我们的实验结果具有有效的链接收割率和分类能力，显示出更高的准确性。

著录项

来源
《International Conference on Computing, Communication and Networking Technologies》|2020年|1-5|共5页
会议地点
作者
Ajay Khare; Ashwini Dalvi; Faruk Kazi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Deep Web; User Agent; Hyperlink; Support Vector Machine(SVM); Naive Bayes(NB); Logistic Regression(LR); Term Frequency Inverse Document Frequency(TF-IDF); Bag of Words(BOW);

机译：深度网络;用户代理;超链接;支持向量机（SVM）;朴素贝叶斯（NB）;逻辑回归（LR）;词频逆文档频率（TF-IDF）;单词袋（BOW）;

相似文献

外文文献
中文文献
专利

1. SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces [J] . Feng Zhao, Jingyu Zhou, Chang Nie, Services Computing, IEEE Transactions on . 2016,第4期

机译：SmartCrawler：两阶段爬虫，可有效收集深Web界面
2. Research on customer purchase behaviors in online take-out platforms based on semantic fuzziness and deep web crawler [J] . Zhao Xu, Zhang Wenju, He Weijun, Journal of ambient intelligence and humanized computing . 2020,第8期

机译：基于语义模糊和深网络爬行者在线外卖平台客户购买行为的研究
3. E-FFC: an enhanced form-focused crawler for domain-specific deep web databases [J] . Yanni Li, Yuping Wang, Jintao Du Journal of Intelligent Information Systems . 2013,第1期

机译：E-FFC：针对特定于域的深度Web数据库的增强的，以表单为中心的搜寻器
4. Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction [C] . Prafful Mishra, Anshul Khurana International Conference on Control, Power, Communication and Computing Technologies . 2018

机译：准确性搜寻器：用于深度Web数据提取的准确搜寻器
5. Constructing Web Crawlers for the World Art Dynamics Technology Platform [D] . Guo, Xueyuan. 2019

机译：为世界艺术动力学技术平台构建网络爬虫
6. Conventional Machine Learning and Deep Learning Approach for Multi-Classification of Breast Cancer Histopathology Images—a Comparative Insight [O] . Shallu Sharma, Rajesh Mehra 2020

机译：母癌组织病理学模型多分类的传统机器学习与深层学习方法 - 一种比较洞察
7. SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERFACES [O] . 2017

机译：智能履带：用于有效收获深网络界面的两级履带器

Smart Crawler for Harvesting Deep web with Multi-Classification

摘要

著录项

相似文献

相关主题

期刊订阅