首页> 外文期刊>IEICE transactions on information and systems >LTDE: A Layout Tree Based Approach for Deep Page Data Extraction
【24h】

LTDE: A Layout Tree Based Approach for Deep Page Data Extraction

机译:LTDE:基于布局树的深页数据提取方法

获取原文
获取外文期刊封面目录资料

摘要

Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.
机译:近年来,从深层网页中提取内容受到了极大的关注。但是,Web文档的HTML结构越来越复杂,仅通过分析HTML源代码就很难识别数据记录。在本文中,我们提出了一种名为LTDE的方法,用于从深层网页中提取数据记录。 LTDE不用分析HTML源代码,而是利用深层Web页面中数据记录的视觉功能。网页被视为视觉块的有限集合。数据记录是具有相似布局的可视块。我们还提出了一种称为布局树的模式识别方法,以对相似的布局可视块进行聚类。计算所有群集的权重,并选择群集中权重最高的可视块作为要提取的数据记录。实验结果表明,与以前的工作相比,LTDE在Web数据提取方面具有更高的有效性和更好的鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号