...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Structured Data Extraction from the Web Based on Partial Tree Alignment
【24h】

Structured Data Extraction from the Web Based on Partial Tree Alignment

机译:基于部分树对齐的Web结构化数据提取

获取原文
获取原文并翻译 | 示例
           

摘要

This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective
机译:本文研究了从任意网页提取结构化数据的问题。提出的研究的目的是自动在页面中分割数据记录,从这些记录中提取数据项/字段,并将提取的数据存储在数据库中。解决该问题的现有方法可以分为三类。第一类方法提供了一些语言来促进数据提取系统的构建。第二类方法使用机器学习技术从带有人类标签的示例中学习包装器(它们是数据提取程序)。手动标记非常耗时,并且很难扩展到Web上的大量站点。第三类方法基于自动模式发现的思想。但是,通常需要多个符合通用模式的页面作为输入。在本文中,我们提出了一种新颖有效的技术(称为DEPTA)来自动执行Web数据提取任务。该方法包括两个步骤:1)识别页面中的单个记录,以及2)从已识别的记录中对齐和提取数据项。对于步骤1,使用基于视觉信息和树匹配的方法来分割数据记录。对于步骤2,提出了一种新颖的部分对齐技术。此方法仅对齐可以确定地对齐的一对记录中的那些数据项,而不承诺其余项。使用来自不同领域的大量网页获得的实验结果表明,所提出的两步技术非常有效

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号