首页> 外文会议>Big data >Self-supervised Automated Wrapper Generation for Weblog Data Extraction
【24h】

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

机译:用于Weblog数据提取的自监督自动包装器生成

获取原文
获取原文并翻译 | 示例

摘要

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives.
机译:从网上提取数据非常困难。在网络上可用的资源类型中,由于博客圈的持续增长,网络日志正变得越来越重要,但探索程度仍然很低。过去从Web日志中提取数据的方法通常涉及手动干预,并且伸缩性较低。本文提出了一种基于Web提要和HTML处理的全自动信息提取方法。该方法包括一个用于生成包装器的模型,该包装器利用Web提要自动生成一组提取规则。该模型不是在帖子之间执行成对比较,而是将Web feed的值与从多个weblog帖子中检索到的相应HTML元素进行匹配。它采用一种概率方法来推导一组规则并使包装器生成过程自动化。在包含2,393个帖子的数据集上对该模型进行了评估,结果(准确度为92%)表明,所提出的技术能够可靠地提取Weblog属性,并且可以在Blogosphere上用于诸如改进的信息检索和更健壮的Web等应用程序保存措施。

著录项

  • 来源
    《Big data》|2013年|292-302|共11页
  • 会议地点 Oxford(GB)
  • 作者单位

    Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

    Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

    Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

    Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web Information Extraction; Automatic Wrapper Induction; Weblogs;

    机译:Web信息提取;自动包装感应;网志;
  • 入库时间 2022-08-26 14:28:20

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号