LTDE: A Layout Tree Based Approach for Deep Page Data Extraction

Jun ZENG; Feng LI; Brendan FLANAGAN; Sachio HIROKAWA

首页> 外文期刊>IEICE transactions on information and systems >LTDE: A Layout Tree Based Approach for Deep Page Data Extraction

【24h】

LTDE: A Layout Tree Based Approach for Deep Page Data Extraction

机译：LTDE：基于布局树的深页数据提取方法

获取原文

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.

机译：近年来，从深层网页中提取内容受到了极大的关注。但是，Web文档的HTML结构越来越复杂，仅通过分析HTML源代码就很难识别数据记录。在本文中，我们提出了一种名为LTDE的方法，用于从深层网页中提取数据记录。 LTDE不用分析HTML源代码，而是利用深层Web页面中数据记录的视觉功能。网页被视为视觉块的有限集合。数据记录是具有相似布局的可视块。我们还提出了一种称为布局树的模式识别方法，以对相似的布局可视块进行聚类。计算所有群集的权重，并选择群集中权重最高的可视块作为要提取的数据记录。实验结果表明，与以前的工作相比，LTDE在Web数据提取方面具有更高的有效性和更好的鲁棒性。

著录项

来源
《IEICE transactions on information and systems》 |2017年第5期|共12页
作者
Jun ZENG; Feng LI; Brendan FLANAGAN; Sachio HIROKAWA;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. A Novel Longwall Mining Layout Approach for Extraction of Deep Coal Deposits [J] . Yoginder P. Chugh, Jingli Zhao, Pengfei Wang, Minerals . 2017,第4期

机译：一种新型的深壁煤层长壁开采布局方法
2. An Automatic Web Data Extraction Approach based on Path Index Trees [J] . Yan Wen, Qingtian Zeng, Hua Duan, International Journal of Performability Engineering . 2018,第10期

机译：基于路径索引树的自动Web数据提取方法
3. PTrees: A point-based approach to forest tree extraction from lidar data [J] . C. Vega, A. Hamrouni, S. El Mokhtari, International journal of applied earth observation and geoinformation . 2014,第Null期

机译：PTrees：一种基于点的激光雷达数据提取林木方法
4. Spreadsheet Metadata Extraction: A Layout-Based Approach [C] . Somchai Chatvichienchai International conference on database and expert systems applications . 2012

机译：电子表格元数据提取：一种基于布局的方法
5. Extraction, Characterization and Modeling of Network Data Features - A Compressive Sensing and Robust PCA based Approach [D] . Bandara, Vidarshana W. 2015

机译：网络数据特征的提取，表征和建模-基于压缩感知和鲁棒PCA的方法
6. CSI-Tree: a regression tree approach for modeling binding properties of DNA-binding molecules based on cognate site identification (CSI) data [O] . Sündüz Keleş, Christopher L. Warren, Clayton D. Carlson, 2008

机译：CSI-Tree：一种基于同源位点识别（CSI）数据建模DNA结合分子的结合特性的回归树方法
7. A Novel Longwall Mining Layout Approach for Extraction of Deep Coal Deposits [O] . Pengfei Wang, Jingli Zhao, Yoginder P. Chugh, 2017

机译：新型长壁开采布置法提取深部煤层

LTDE: A Layout Tree Based Approach for Deep Page Data Extraction

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅