首页> 外文期刊>International Journal of Performability Engineering >An Automatic Web Data Extraction Approach based on Path Index Trees
【24h】

An Automatic Web Data Extraction Approach based on Path Index Trees

机译:基于路径索引树的自动Web数据提取方法

获取原文
获取原文并翻译 | 示例
       

摘要

This paper proposes a novel approach called ITE to extract web data records in a fully automatic way. The approach effectively utilizes the tag index information in different layers of the HTML DOM tree and abstracts the concept of index tree together with its repetitiveness and consecutiveness, which can characterize the key structural information in a web page. The concept of repetitiveness indicates the structural similarities among data records, and the concept of consecutiveness represents the sequential features of multiple records. Then, the complex DOM tree can be compressed to a set of index trees based on these concepts. We also provide a series of properties as theoretical support. The extraction process is divided into three steps, namely, repetitiveness discovery, consecutiveness discovery, and index tree merging. To handle data field missing, multiple record roots, and other complicated situations, we propose a digital sequence similarity measurement and a hierarchical clustering approach to find the repeating patterns. Then, data records are identified based on the consecutiveness discovery method, and the data blocks containing full data records are restored by merging the index trees. Experiments demonstrate the effectiveness and efficiency of the proposed approach. It outperforms existing classic work in accuracy and has a satisfying execution time, which means it is applicable to large datasets. The time complexity is linear to the number of leaf nodes in the DOM tree of a web page.
机译:本文提出了一种称为ITE的新方法,以全自动地提取Web数据记录。该方法有效地利用了HTML DOM树的不同层中的标签索引信息,并将索引树的概念与其重复性和连续性一起抽象,可以在网页中表征关键结构信息。重复性的概念表明数据记录之间的结构相似之处,并且连续性的概念代表了多个记录的顺序特征。然后,可以基于这些概念将复杂的DOM树压缩到一组索引树。我们还提供一系列属性作为理论支持。提取过程分为三个步骤,即重复性发现,传导性发现和索引树合并。要处理数据字段缺失,多个记录根和其他复杂情况,我们提出了一种数字序列相似度测量和分层聚类方法来查找重复模式。然后,基于连续发现方法识别数据记录,通过合并索引树来恢复包含完整数据记录的数据块。实验证明了所提出的方法的有效性和效率。它以准确性始于现有的经典工作,并且具有满足的执行时间,这意味着它适用于大型数据集。时间复杂度是线性的,到网页的DOM树中的叶节点数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号