首页> 外文期刊>Procedia Computer Science >Web Data Extraction Approach for Deep Web using WEIDJ
【24h】

Web Data Extraction Approach for Deep Web using WEIDJ

机译:使用WEIDJ进行深度Web的Web数据提取方法

获取原文
       

摘要

Data extraction is one of the most prominent areas in data mining analysis that is been extensively studied especially in the field of data requirements and reservoir. The main aim of data extraction with regards to semi-structured data is to retrieve beneficial information from the World Wide Web. The data from large web data also known as deep web is retrievable but it requires request through form submission because it cannot be performed by any search engines. Data mining applications and automatic data extraction are very cumbersome due to the diverse structure of web pages. Most of the previous data extraction techniques were dealing with various data types such as text, audio, video and etc. but research works that are focusing on image as data are still lacking. Document Object Model (DOM) is an example of the state of the art of data extraction technique that is related to research work in mining image data. DOM was the method used to solve semi-structured data extraction from web. However, as the HTML documents start to grow larger, it has been found that the process of data extraction has been plagued with lengthy processing time and noisy information. In this research work, we propose an improved model namely Wrapper Extraction of Image using DOM and JSON (WEIDJ) in response to the promising results of mining in a higher volume of web data from a various types of image format and taking the consideration of web data extraction from deep web. To observe the efficiency of the proposed model, we compare the performance of data extraction by different level of page extraction with existing methods such as VIBS, MDR, DEPTA and VIDE. It has yielded the best results in Precision with 100, Recall with 97.93103 and F-measure with 98.9547.
机译:数据提取是数据挖掘分析中最突出的领域之一,已被广泛研究,尤其是在数据需求和存储领域。关于半结构化数据的数据提取的主要目的是从万维网检索有益的信息。大型Web数据(也称为深度Web)中的数据是可检索的,但由于无法由任何搜索引擎执行,因此需要通过表单提交进行请求。由于网页结构的多样性,数据挖掘应用程序和自动数据提取非常麻烦。以前的大多数数据提取技术都处理各种数据类型,例如文本,音频,视频等,但是仍然缺少针对图像的研究工作。文档对象模型(DOM)是数据提取技术的最新发展范例,该技术与挖掘图像数据的研究工作有关。 DOM是用于解决从Web提取半结构化数据的方法。但是,随着HTML文档开始变得越来越大,已经发现数据提取过程一直困扰着冗长的处理时间和嘈杂的信息。在这项研究工作中,我们提出了一种改进的模型,即使用DOM和JSON(WEIDJ)进行图像的包装提取(Wrapper Extraction of Image),以响应从各种类型的图像格式中大量挖掘Web数据并考虑到Web的可喜结果从深层网络提取数据。为了观察所提出模型的效率,我们将通过不同级别的页面提取与现有方法(例如VIBS,MDR,DEPTA和VIDE)的数据提取性能进行了比较。在Precision(精度为100),Recall(回调)为97.93103和F-measure(精度为98.9547)方面,它获得了最佳结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号