首页> 外文会议>International Conference on Computing, Communication and Networking Technologies >Smart Crawler for Harvesting Deep web with Multi-Classification
【24h】

Smart Crawler for Harvesting Deep web with Multi-Classification

机译:具备多分类功能的深层Web收获智能履带

获取原文

摘要

In recent era data available on the internet is playing a vital role. According to research, most precious and valuable data is present in the deep web so interest in techniques to efficiently site invisible web is increasing. The challenges to extract the deep web are requirement of huge volume of resources, dynamic nature of the deep web, coverage of a wider area of deep web and higher efficiency of collected results from deep web with accuracy. Along with all the above challenges, user demand of privacy and identity is to be maintained. In this paper we proposed a smart crawler that efficiently searches the deep web and avoids visiting irrelevant pages. A smart crawler starts crawling from the center page of seed URL and goes on crawling till the last link available. The crawler has an ability to separate active and inactive links based on requests to sever of hyperlink. The crawler also contains text-based site classifier using neural network and natural language processing as Term Frequency Inverse Document Frequency and Bag of Words with supervised machine learning techniques as logistic regression, support vector machine and naive bayes. Also HTML tags are extracted from hyperlinks along with data which plays a huge role in data analysis and all this is separately saved in a centralized database. Our experimental results with efficient link reaping rate and classification show higher accuracy compared to different crawlers.
机译:在最近的时代,互联网上可用的数据起着至关重要的作用。根据研究,最有价值的数据都存在于深层网络中,因此人们对有效地定位不可见网络的技术的兴趣正在增加。提取深层Web的挑战是需要大量资源,深层Web的动态性质,覆盖更深的深层Web区域以及从深层Web准确地收集结果的更高效率的要求。除了上述所有挑战之外,还必须保持用户对隐私和身份的需求。在本文中,我们提出了一种智能搜寻器,该搜寻器可以有效地搜索深层网络并避免访问不相关的页面。一个智能爬网程序将从种子URL的中心页面开始爬网,然后继续爬网直到最后一个可用链接。爬网程序具有根据对超链接的切断请求来分离活动链接和非活动链接的功能。搜寻器还包含基于文本的站点分类器,该分类器使用神经网络和自然语言处理(如词频逆文档频率和词袋)以及受监督的机器学习技术(如逻辑回归,支持向量机和朴素贝叶斯)来进行。此外,还会从超链接中提取HTML标签以及在数据分析中起着重要作用的数据,所有这些都分别保存在集中式数据库中。与不同的搜寻器相比,我们的实验结果具有有效的链接收割率和分类能力,显示出更高的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号