【24h】

Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

机译:使用树匹配和部分树对齐自动生成包装器

获取原文

摘要

This paper is concerned with the problem of structured data extraction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key innovation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. This paper is based on our work published in KDD-03 and WWW-05.
机译:本文涉及从网页中提取结构化数据的问题。研究的目的是自动在页面中分割数据记录,从这些记录中提取数据项/字段,并将提取的数据存储在数据库中。在本文中,我们首先介绍提取问题,然后讨论现有的主要方法及其局限性。之后,我们介绍了一种新颖的技术(称为DEPTA)来自动执行Web数据提取。该方法包括三个步骤:(1)在页面中标识具有相似模式的数据记录;(2)从标识的数据记录中对齐和提取数据项;(3)生成基于树的正则表达式以方便以后从其他相似内容中提取页面。关键创新是提出了一种新的多树对齐算法,该算法称为部分树对齐,该算法特别适合于Web数据提取。本文基于我们在KDD-03和WWW-05中发布的工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号