首页> 外文会议>International Conference on Web Engineering >Automatic Generation of Wrapper for Data Extraction from the Web
【24h】

Automatic Generation of Wrapper for Data Extraction from the Web

机译:自动生成包装器的包装纸

获取原文

摘要

With the development of the Internet, the Web has become invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on extracting schema, which generates automatically a wrapper to extract data from an HTML document, and produces an XML document conforming to given DTD. After the user defines extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can correctly extract the required data from the source document with high accuracy.
机译:随着互联网的发展,Web已经成为无价的信息源。为了使用此信息超过人类浏览,HTML中的网页必须转换为有意义的软件程序的格式。包装器是将HTML文档转换为语义有意义的XML文件的有用技术。在本文中,我们提出了一种基于提取模式的数据提取方法,它自动生成包装器以从HTML文档中提取数据,并产生符合给定DTD的XML文档。在用户以DTD的形式定义提取数据模式之后,将自动使用感应和倾斜算法生成包装器。实验表明该方法可以高精度地正确地从源文档中提取所需数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号