首页> 外文会议>The 2nd International Conference on Software Engineering and Data Mining >Web wrapper generation using tree alignment and transfer learning
【24h】

Web wrapper generation using tree alignment and transfer learning

机译:使用树对齐和转移学习的Web包装器生成

获取原文

摘要

This paper studies the web wrapper generation for web pages of forum, blog and news web sites. While more and more web pages are dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. We present a new tree alignment algorithm to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. Based on the alignment, we merge the trees into one union tree whose nodes record the statistical information gotten from multiple web pages. We use a transfer learning method to find the most likely content block and use the alignment algorithm to detect the repeat patterns on the union tree. After that, we generate a wrapper to extract data from web pages. Experimental results show that the method can achieve high extraction accuracy and has steady performance.
机译:本文研究了论坛,博客和新闻网站的网页的Web包装器。尽管越来越多的网页是使用填充了数据库数据的通用模板动态生成的。本文提出了一种新的方法,该方法使用树对齐和转移学习方法从此类网页生成包装器。我们提出了一种新的树对齐算法,以找到输入网页的最佳匹配结构。一种线性回归方法被用来获得不同标签匹配的权重。基于对齐方式,我们将树合并到一个联合树中,该联合树的节点记录从多个网页获得的统计信息。我们使用一种转移学习方法来找到最可能的内容块,并使用对齐算法来检测并集树上的重复模式。之后,我们生成一个包装器以从网页中提取数据。实验结果表明,该方法提取精度高,性能稳定。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号