首页> 外文会议>IASTED International Multi-conference on Applied Informatics >OLERA: OnLine Extraction Rule Analysis for Semi-structured Documents
【24h】

OLERA: OnLine Extraction Rule Analysis for Semi-structured Documents

机译:olera:半结构文档的在线提取规则分析

获取原文

摘要

Information extraction (IE) from semi-structured Web documents plays an important role for a variety of information agents. Over the past decade, researchers have developed a rich family of generic IE techniques based on supervised approach which learn extraction rules from user-labelled training examples. However, annotating training data can be expensive when a lot of data sources need to be extracted. In this article, we introduce annotation-free IE using pattern mining and string alignment techniques. We describe OLERA, a semi-supervised IE system that produces extraction rules by aligning similar contents of multiple input records together and presents the result in a spreadsheet-like table. Therefore, users do not need to annotate the input documents but only to specify the scheme for the extracted data after the extraction pattern is discovered. Another plus is that this approach works not only for multi-record Web pages (as a limitation of some unsuper-vised IE approaches) but also single-record Web pages.
机译:来自半结构化Web文档的信息提取(即)对各种信息代理人起着重要作用。在过去的十年中,研究人员已经开发了一种丰富的通用系列,基于监督方法,从用户标记的训练示例中学习提取规则。但是,当需要提取大量数据源时,注释训练数据可能是昂贵的。在本文中,我们使用模式挖掘和串对准技术介绍无注释。我们描述了Olera,一个半监督的IE系统,通过将多个输入记录的类似内容与众不同,并在类似的电子表格表中呈现结果,产生提取规则。因此,用户不需要注释输入文档,而是仅在发现提取模式之后指定提取数据的方案。另一个优点是这种方法不仅适用于多录录网页(作为一些无核心的IE方法的限制),而且是单录网页。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号