Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

机译：使用树匹配和部分树对齐自动生成包装器

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper is concerned with the problem of structured data extraction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key innovation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. This paper is based on our work published in KDD-03 and WWW-05.

机译：本文涉及从网页中提取结构化数据的问题。研究的目的是自动在页面中分割数据记录，从这些记录中提取数据项/字段，并将提取的数据存储在数据库中。在本文中，我们首先介绍提取问题，然后讨论现有的主要方法及其局限性。之后，我们介绍了一种新颖的技术（称为DEPTA）来自动执行Web数据提取。该方法包括三个步骤：（1）在页面中标识具有相似模式的数据记录;（2）从标识的数据记录中对齐和提取数据项;（3）生成基于树的正则表达式以方便以后从其他相似内容中提取页面。关键创新是提出了一种新的多树对齐算法，该算法称为部分树对齐，该算法特别适合于Web数据提取。本文基于我们在KDD-03和WWW-05中发布的工作。

著录项

来源
《National Conference on Artificial Intelligence(AAAI-06);Innovative Applications of Artificial Intelligence Conference(IAAI-06)》|2006年|1687-1690|共4页
会议地点
作者
Yanhong Zhai; Bing Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词

相似文献

外文文献
中文文献
专利

1. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases [J] . Dufayard JF, Duret L, Penel S, Bioinformatics . 2005,第11期

机译：系统发育树中的树型匹配：在同源基因序列数据库中自动搜索直系同源物或旁系同源物
2. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases [J] . Dufayard JF, Duret L, Penel S, Bioinformatics . 2005,第11期

机译：系统发育树中的树型匹配：在同源基因序列数据库中自动搜索直系同源物或旁系同源物
3. From tree matching to sparse graph alignment [J] . Luca Ganassali, Laurent Massoulié JMLR: Workshop and Conference Proceedings . 2020,第2010期

机译：从树匹配稀疏图形对齐
4. Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment [C] . Yanhong Zhai, Bing Liu National Conference on Artificial Intelligence . 2006

机译：使用树匹配和部分树对齐的自动包装器
5. COMPLETE AND PARTIAL GENERATION OF TREE CHARACTERISTICS FOR MIXED SPECIES STANDS. [D] . VAN DEUSEN, PAUL C. 1984

机译：混合物种站的树特征的完整和部分生成。
6. SubClonal Hierarchy Inference from Somatic Mutations: Automatic Reconstruction of Cancer Evolutionary Trees from Multi-region Next Generation Sequencing [O] . Noushin Niknafs, Violeta Beleva-Guthrie, Daniel Q. Naiman, 2015

机译：从体细胞突变的亚克隆层次推断：从多区域下一代测序的癌症进化树的自动重建。
7. Automatic Wrapper Adaptation by Tree Edit Distance Matching [O] . Ferrara, Emilio, Baumgartner, Robert 2011

机译：树编辑距离匹配的自动包装器自适应

Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

摘要

著录项

相似文献

相关主题

期刊订阅