首页> 外文会议>International conference on web-age information management >WYSIWYE*: An Algebra for Expressing Spatial and Textual Rules for Information Extraction
【24h】

WYSIWYE*: An Algebra for Expressing Spatial and Textual Rules for Information Extraction

机译:所见即所得*:用于表达信息提取的空间和文本规则的代数

获取原文

摘要

The visual layout of a webpage can provide valuable clues for certain types of Information Extraction (IE) tasks. In traditional rule based IE frameworks, these layout cues are mapped to rules that operate on the HTML source of the webpages. In contrast, we have developed a framework in which the rules can be specified directly at the layout level. This has many advantages, since the higher level of abstraction leads to simpler extraction rules that are largely independent of the source code of the page, and, therefore, more robust. It can also enable specification of new types of rules that are not otherwise possible. To the best of our knowledge, there is no general framework that allows declarative specification of information extraction rules based on spatial layout. Our framework is complementary to traditional text based rules framework and allows a seamless combination of spatial layout based rules with traditional text based rules. We describe the algebra that enables such a system and its efficient implementation using standard relational and text indexing features of a relational database. We demonstrate the simplicity and efficiency of this system for a task involving the extraction of software system requirements from software product pages.
机译:网页的视觉布局可以为某些类型的信息提取(IE)任务提供有价值的线索。在传统的基于规则的IE框架中,这些布局提示被映射到在网页的HTML源上运行的规则。相反,我们开发了一个框架,可以在布局级别直接指定规则。这具有许多优点,因为更高的抽象级别会导致更简单的提取规则,而这些提取规则在很大程度上与页面的源代码无关,因此更加健壮。它还可以指定新类型的规则,而这在其他情况下是不可能的。据我们所知,没有通用的框架允许声明性地指定基于空间布局的信息提取规则。我们的框架是对传统的基于文本的规则框架的补充,并允许将基于空间布局的规则与传统的基于文本的规则无缝组合。我们描述了使用关系数据库的标准关系和文本索引功能实现这种系统的代数及其有效实现。我们演示了此系统的简单性和效率,该任务涉及从软件产品页面提取软件系统需求的任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号