首页> 外文会议>Big data >Self-supervised Automated Wrapper Generation for Weblog Data Extraction

【24h】

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

机译：用于Weblog数据提取的自监督自动包装器生成

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives.

机译：从网上提取数据非常困难。在网络上可用的资源类型中，由于博客圈的持续增长，网络日志正变得越来越重要，但探索程度仍然很低。过去从Web日志中提取数据的方法通常涉及手动干预，并且伸缩性较低。本文提出了一种基于Web提要和HTML处理的全自动信息提取方法。该方法包括一个用于生成包装器的模型，该包装器利用Web提要自动生成一组提取规则。该模型不是在帖子之间执行成对比较，而是将Web feed的值与从多个weblog帖子中检索到的相应HTML元素进行匹配。它采用一种概率方法来推导一组规则并使包装器生成过程自动化。在包含2,393个帖子的数据集上对该模型进行了评估，结果（准确度为92％）表明，所提出的技术能够可靠地提取Weblog属性，并且可以在Blogosphere上用于诸如改进的信息检索和更健壮的Web等应用程序保存措施。

著录项

来源
《Big data》|2013年|292-302|共11页
会议地点 Oxford(GB)
作者
George Gkotsis; Karen Stepanyan; Alexandra I. Cristea; Mike Joy;
展开▼
作者单位

Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Web Information Extraction; Automatic Wrapper Induction; Weblogs;

机译：Web信息提取；自动包装感应；网志;
入库时间 2022-08-26 14:28:20

相似文献

外文文献
中文文献
专利

1. Entropy-based automated wrapper generation for weblog data extraction [J] . George Gkotsis, Karen Stepanyan, Alexandra I. Cristea, World Wide Web . 2014,第4期

机译：基于熵的自动包装器生成，用于Weblog数据提取
2. Extraction Japanese Slang from Weblog Data based on Script Type and Stroke Count [J] . Kazuyuki Matsumoto, Kyosuke Akita, Xielifuguli Keranmu, Procedia Computer Science . 2014,第1期

机译：根据脚本类型和笔划数从Weblog数据中提取日语S语
3. L-wrappers: concepts, properties and construction - A declarative approach to data extraction from web sources [J] . Badica C, Badica A, Popescu E, Soft computing: A fusion of foundations, methodologies and applications . 2007,第8期

机译：L包装器：概念，属性和构造-一种从Web来源提取数据的声明性方法
4. Self-supervised Automated Wrapper Generation for Weblog Data Extraction [C] . George Gkotsis, Karen Stepanyan, Alexandra I. Cristea, British national conference on databases . 2013

机译：用于博客数据提取的自动监督自动包装器
5. Automatic wrapper generation for the extraction of search result records from search engines. [D] . Zhao, Hongkun. 2007

机译：自动包装器生成，用于从搜索引擎中提取搜索结果记录。
6. Strategies for Medical Data Extraction and Presentation Part 3: Automated Context- and User-Specific Data Extraction [O] . Bruce Reiner 2015

机译：医学数据提取和表示的策略第3部分：特定于上下文和用户的自动数据提取
7. Self-supervised automated wrapper generation for weblog data extraction [O] . Gkotsis, George, Stepanyan, Karen, Cristea, Alexandra I., 2013

机译：自我监督的自动包装器生成，用于Weblog数据提取

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

摘要

著录项

相似文献

相关主题

期刊订阅