首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach
【24h】

Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach

机译:学习适应Web信息提取知识并通过贝叶斯方法发现新属性

获取原文
获取原文并翻译 | 示例

摘要

This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the information extraction knowledge previously learned from a source Web site to a new unseen site, at the same time, discovering previously unseen attributes. Two kinds of text-related clues from the source Web site are considered. The first kind of clue is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of clue is derived from the previously extracted or collected items. A generative model for the generation of the site-independent content information and the site-dependent layout format of the text fragments related to attribute values contained in a Web page is designed to harness the uncertainty involved. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative model for identifying new training data for learning the new wrapper for new unseen sites. Previously unseen attributes together with their semantic labels can also be discovered via another EM-based Bayesian learning based on the generative model. We have conducted extensive experiments from more than 30 real-world Web sites in three different domains to demonstrate the effectiveness of our framework.
机译:本文提出了一种贝叶斯学习框架,用于通过新的属性发现来适应信息提取包装器,从而减少了从看不见的网站提取精确信息时的人工工作。我们的方法旨在自动地将以前从源Web站点学到的信息提取知识适应到新的看不见的站点,同时发现以前看不见的属性。考虑了来自源网站的两种与文本有关的线索。第一种线索是从先前学习的包装中包含的提取模式中获得的。第二类线索来自先前提取或收集的项目。设计用于生成与网站中包含的属性值相关的与站点无关的内容信息和文本片段的与站点有关的布局格式的生成模型,以利用所涉及的不确定性。在提出的生成模型下开发了贝叶斯学习和期望最大化(EM)技术,用于识别新的训练数据,以学习新的包装纸用于新的看不见的站点。先前未见的属性及其语义标签也可以通过基于生成模型的另一个基于EM的贝叶斯学习来发现。我们已经在三个不同域中的30多个真实世界的网站上进行了广泛的实验,以证明我们框架的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号