Web Data Extraction Approach for Deep Web using WEIDJ

Ily Amalina Ahmad Sabri; Mustafa Man; Wan Aezwani Wan Abu Bakar; Ahmad Nazari Mohd Rose

首页> 外文期刊>Procedia Computer Science >Web Data Extraction Approach for Deep Web using WEIDJ

【24h】

Web Data Extraction Approach for Deep Web using WEIDJ

机译：使用WEIDJ进行深度Web的Web数据提取方法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data extraction is one of the most prominent areas in data mining analysis that is been extensively studied especially in the field of data requirements and reservoir. The main aim of data extraction with regards to semi-structured data is to retrieve beneficial information from the World Wide Web. The data from large web data also known as deep web is retrievable but it requires request through form submission because it cannot be performed by any search engines. Data mining applications and automatic data extraction are very cumbersome due to the diverse structure of web pages. Most of the previous data extraction techniques were dealing with various data types such as text, audio, video and etc. but research works that are focusing on image as data are still lacking. Document Object Model (DOM) is an example of the state of the art of data extraction technique that is related to research work in mining image data. DOM was the method used to solve semi-structured data extraction from web. However, as the HTML documents start to grow larger, it has been found that the process of data extraction has been plagued with lengthy processing time and noisy information. In this research work, we propose an improved model namely Wrapper Extraction of Image using DOM and JSON (WEIDJ) in response to the promising results of mining in a higher volume of web data from a various types of image format and taking the consideration of web data extraction from deep web. To observe the efficiency of the proposed model, we compare the performance of data extraction by different level of page extraction with existing methods such as VIBS, MDR, DEPTA and VIDE. It has yielded the best results in Precision with 100, Recall with 97.93103 and F-measure with 98.9547.

机译：数据提取是数据挖掘分析中最突出的领域之一，已被广泛研究，尤其是在数据需求和存储领域。关于半结构化数据的数据提取的主要目的是从万维网检索有益的信息。大型Web数据（也称为深度Web）中的数据是可检索的，但由于无法由任何搜索引擎执行，因此需要通过表单提交进行请求。由于网页结构的多样性，数据挖掘应用程序和自动数据提取非常麻烦。以前的大多数数据提取技术都处理各种数据类型，例如文本，音频，视频等，但是仍然缺少针对图像的研究工作。文档对象模型（DOM）是数据提取技术的最新发展范例，该技术与挖掘图像数据的研究工作有关。 DOM是用于解决从Web提取半结构化数据的方法。但是，随着HTML文档开始变得越来越大，已经发现数据提取过程一直困扰着冗长的处理时间和嘈杂的信息。在这项研究工作中，我们提出了一种改进的模型，即使用DOM和JSON（WEIDJ）进行图像的包装提取（Wrapper Extraction of Image），以响应从各种类型的图像格式中大量挖掘Web数据并考虑到Web的可喜结果从深层网络提取数据。为了观察所提出模型的效率，我们将通过不同级别的页面提取与现有方法（例如VIBS，MDR，DEPTA和VIDE）的数据提取性能进行了比较。在Precision（精度为100），Recall（回调）为97.93103和F-measure（精度为98.9547）方面，它获得了最佳结果。

著录项

来源
《Procedia Computer Science》 |2019年第1期|共10页
作者
Ily Amalina Ahmad Sabri; Mustafa Man; Wan Aezwani Wan Abu Bakar; Ahmad Nazari Mohd Rose;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Document Object ModelWeb Data ExtractionWrapper Extraction of Image using DOMJSON (WEIDJ);

机译：文档对象模型Web数据提取使用DOMJSON（WEIDJ）进行图像的包装器提取;
入库时间 2022-08-18 20:07:49

相似文献

外文文献
中文文献
专利

1. Formal concept analysis approach for data extraction from a limited deep web database [J] . Zhuo Zhang, Juan Du, Liming Wang Journal of Intelligent Information Systems . 2013,第2期

机译：从有限的深度Web数据库提取数据的形式化概念分析方法
2. ViDE: A Vision-Based Approach for Deep Web Data Extraction [J] . Liu Wei, Meng Xiaofeng, Meng Weiyi Knowledge and Data Engineering, IEEE Transactions on . 2010,第3期

机译：ViDE：一种基于视觉的深度Web数据提取方法
3. DWDE-IR: An Efficient Deep Web Data Extraction for Information Retrieval on Web Mining [J] . Aysha Banu andM. Chitra Journal of Emerging Technologies in Web Intelligence . 2014,第1期

机译：DWDE-IR：一种有效的深度Web数据提取，用于Web挖掘中的信息检索
4. WEIDJ: An Improvised Algorithm for Image Extraction from Web Pages [C] . Ily Amalina Ahmad Sabri, Mustafa Man International Conference on Information Technology . 2017

机译：WEIDJ：网页图像提取的简易算法
5. Design and Development of Intelligent Web Mining System for Extraction of Information from Web Databases [D] . Sharma, Sanjeev Kumar. 2010

机译：Web数据库提取信息的智能网络挖掘系统的设计与开发
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. WEIDJ: Development of a new algorithm for semi-structured web data extraction [O] . Ily Amalina Ahmad Sabri, Mustafa Man 2021

机译：Weidj：开发新型网络数据提取的新算法

Web Data Extraction Approach for Deep Web using WEIDJ

摘要

著录项

相似文献

相关主题

期刊订阅