An Automatic Web Data Extraction Approach based on Path Index Trees

Yan Wen; Qingtian Zeng; Hua Duan; Feng Zhang; Xin Chen

首页> 外文期刊>International Journal of Performability Engineering >An Automatic Web Data Extraction Approach based on Path Index Trees

【24h】

An Automatic Web Data Extraction Approach based on Path Index Trees

机译：基于路径索引树的自动Web数据提取方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a novel approach called ITE to extract web data records in a fully automatic way. The approach effectively utilizes the tag index information in different layers of the HTML DOM tree and abstracts the concept of index tree together with its repetitiveness and consecutiveness, which can characterize the key structural information in a web page. The concept of repetitiveness indicates the structural similarities among data records, and the concept of consecutiveness represents the sequential features of multiple records. Then, the complex DOM tree can be compressed to a set of index trees based on these concepts. We also provide a series of properties as theoretical support. The extraction process is divided into three steps, namely, repetitiveness discovery, consecutiveness discovery, and index tree merging. To handle data field missing, multiple record roots, and other complicated situations, we propose a digital sequence similarity measurement and a hierarchical clustering approach to find the repeating patterns. Then, data records are identified based on the consecutiveness discovery method, and the data blocks containing full data records are restored by merging the index trees. Experiments demonstrate the effectiveness and efficiency of the proposed approach. It outperforms existing classic work in accuracy and has a satisfying execution time, which means it is applicable to large datasets. The time complexity is linear to the number of leaf nodes in the DOM tree of a web page.

机译：本文提出了一种称为ITE的新方法，以全自动地提取Web数据记录。该方法有效地利用了HTML DOM树的不同层中的标签索引信息，并将索引树的概念与其重复性和连续性一起抽象，可以在网页中表征关键结构信息。重复性的概念表明数据记录之间的结构相似之处，并且连续性的概念代表了多个记录的顺序特征。然后，可以基于这些概念将复杂的DOM树压缩到一组索引树。我们还提供一系列属性作为理论支持。提取过程分为三个步骤，即重复性发现，传导性发现和索引树合并。要处理数据字段缺失，多个记录根和其他复杂情况，我们提出了一种数字序列相似度测量和分层聚类方法来查找重复模式。然后，基于连续发现方法识别数据记录，通过合并索引树来恢复包含完整数据记录的数据块。实验证明了所提出的方法的有效性和效率。它以准确性始于现有的经典工作，并且具有满足的执行时间，这意味着它适用于大型数据集。时间复杂度是线性的，到网页的DOM树中的叶节点数量。

著录项

来源
《International Journal of Performability Engineering》 |2018年第10期|共12页
作者
Yan Wen; Qingtian Zeng; Hua Duan; Feng Zhang; Xin Chen;
展开▼
作者单位

College of Computer Science and Engineering Shandong University of Science and Technology;

College of Electronic Communication and Physics Shandong University of Science and Technology;

College of Mathematics and System Science Shandong University of Science and Technology;

College of Computer Science and Engineering Shandong University of Science and Technology;

College of Computer Science and Engineering Shandong University of Science and Technology;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类工程设计与测绘;
关键词
Tag path; Index tree; Automatic data extraction; Web data;

机译：标签路径;索引树;自动数据提取;Web数据;

相似文献

外文文献
中文文献
专利

1. An Automatic Web Data Extraction Approach based on Path Index Trees [J] . Yan Wen, Qingtian Zeng, Hua Duan, International Journal of Performability Engineering . 2018,第10期

机译：基于路径索引树的自动Web数据提取方法
2. DATA EXTRACTION FROM REPOSITORIES ON THE WEB: A SEMI-AUTOMATIC APPROACH [J] . Coskun Bayrak, Hayrettin Kolukisaoglu, Steve Sieloff Journal of integrated design & process science . 2003,第4期

机译：从网络存储库中提取数据：一种半自动方法
3. DATA EXTRACTION FROM REPOSITORIES ON THE WEB: A SEMI-AUTOMATIC APPROACH [J] . Coskun Bayrak, Hayrettin Kolukisaoglu, Steve Sieloff Journal of integrated design & process science . 2003,第4期

机译：从网络存储库中提取数据：一种半自动方法
4. Automatic data extraction of websites using data path matching and alignment [C] . Yu-Chun Chu, Chiun-Chieh Hsu, Chen-Jhe Lee, International Conference on Digital Information Processing and Communications . 2015

机译：使用数据路径匹配和对齐自动提取网站数据
5. Automatic Identification of Topic Tags from Texts Based on Expansion-Extraction Approach. [D] . Yang, Seungwon. 2013

机译：基于扩展-提取方法的文本自动识别主题标签。
6. Path Optimization along Buoys Based on the Shortest Path Tree with Uncertain Atmospheric and Oceanographic Data [O] . Han Xue, Tian Chai 2021

机译：基于浮标的路径优化基于不确定的大气和海洋数据的最短路径树
7. An Automatic Tree Skeleton Extraction Approach Based on Multi-View Slicing Using Terrestrial LiDAR Scans Data [O] . Mingyao Ai, Yuan Yao, Qingwu Hu, 2020

机译：一种基于多视图切片的自动树骨架提取方法，使用地面激光乐队扫描数据

An Automatic Web Data Extraction Approach based on Path Index Trees

摘要

著录项

相似文献

相关主题

期刊订阅