Structured Data Extraction from the Web Based on Partial Tree Alignment

Yanhong Zhai; Bing Liu

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Structured Data Extraction from the Web Based on Partial Tree Alignment

【24h】

Structured Data Extraction from the Web Based on Partial Tree Alignment

机译：基于部分树对齐的Web结构化数据提取

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective

机译：本文研究了从任意网页提取结构化数据的问题。提出的研究的目的是自动在页面中分割数据记录，从这些记录中提取数据项/字段，并将提取的数据存储在数据库中。解决该问题的现有方法可以分为三类。第一类方法提供了一些语言来促进数据提取系统的构建。第二类方法使用机器学习技术从带有人类标签的示例中学习包装器（它们是数据提取程序）。手动标记非常耗时，并且很难扩展到Web上的大量站点。第三类方法基于自动模式发现的思想。但是，通常需要多个符合通用模式的页面作为输入。在本文中，我们提出了一种新颖有效的技术（称为DEPTA）来自动执行Web数据提取任务。该方法包括两个步骤：1）识别页面中的单个记录，以及2）从已识别的记录中对齐和提取数据项。对于步骤1，使用基于视觉信息和树匹配的方法来分割数据记录。对于步骤2，提出了一种新颖的部分对齐技术。此方法仅对齐可以确定地对齐的一对记录中的那些数据项，而不承诺其余项。使用来自不同领域的大量网页获得的实验结果表明，所提出的两步技术非常有效

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2006年第2006期|p.1614-1628|共15页
作者
Yanhong Zhai; Bing Liu;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Internet; database management systems; information retrieval; learning (artificial intelligence); storage management; tree data structures; Web mining; Web pages; automatic pattern discovery; data records; database; machine learning techniques; partial tree alignmen;

机译：互联网;数据库管理系统;信息检索;学习（人工智能）;存储管理;树数据结构;Web挖掘;网页;自动模式发现;数据记录;数据库;机器学习技术;部分树对齐;

相似文献

外文文献
中文文献
专利

1. T-BAS: Tree-Based Alignment Selector toolkit for phylogenetic-based placement, alignment downloads and metadata visualization: an example with the Pezizomycotina tree of life [J] . Bioinformatics . 2017,第8期

机译：T-BAS：基于树的对齐选择器工具包，用于系统发育的展示位置，对准下载和元数据可视化：具有培养Zizomycotina生活树的示例
2. Multi Level Web Data Extraction Based Topical Visual Structure Clustering for Efficient Web Search [J] . Sureshkumar T, Shanthi N Journal of computational and theoretical nanoscience . 2017,第9期

机译：基于多级Web数据提取的高效网络搜索的局部视觉结构聚类
3. STEM: a suffix tree-based method for web data records extraction [J] . Fang Yixiang, Xie Xiaoqin, Zhang Xiaofeng, Knowledge and information systems . 2018,第2期

机译：Stef：基于后缀的网络数据记录提取方法
4. Web Data Extraction Based on Visual Information and Partial Tree Alignment [C] . Siwu Fan, Xinjun Wang, Yongquan Dong Web Information System and Application Conference . 2014

机译：基于视觉信息和局部树对齐的Web数据提取
5. Structured data extraction from the Web. [D] . Zhai, Yanhong. 2006

机译：从Web提取结构化数据。
6. T-BAS Version 2.1: Tree-Based Alignment Selector Toolkit for Evolutionary Placement of DNA Sequences and Viewing Alignments and Specimen Metadata on Curated and Custom Trees [O] . Ignazio Carbone, James B. White, Jolanta Miadlikowska, 2019

机译：T-BAS版本2.1：基于树的比对选择器工具包用于DNA序列的进化放置以及在定制和定制树上查看比对和标本元数据
7. Web data extraction based on partial tree alignment [O] . Yanhong Zhai 2005

机译：基于部分树对齐的Web数据提取

Structured Data Extraction from the Web Based on Partial Tree Alignment

摘要

著录项

相似文献

相关主题

期刊订阅