首页> 外文会议>Knowledge-Based Systems for Safety Critical Applications >Extracting structured data from Web pages (Poster)
【24h】

Extracting structured data from Web pages (Poster)

机译:从网页中提取结构化数据(海报)

获取原文
获取原文并翻译 | 示例

摘要

Many Web sites contain a large collection of "structured" Web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. Our goal is to automatically extract structured data from a collection of pages described above, without any human input like manually generated rules or training sets. Extracting structured data gives us greater querying power over the data and is useful in information integration systems. Our approach consists of two stages. In the first stage, the unknown template used to create the pages is deduced. In the second stage, the deduced template is used to extract the values. We focus on the first stage since it is more challenging. The full version contains formal definition of high occurrence correlation and our algorithm. We evaluated our approach by considering 9 real collections of pages.
机译:许多网站包含大量的“结构化”网页。这些页面对来自底层结构化源的数据进行编码,并且通常是动态生成的。我们的目标是从上述页面集合中自动提取结构化数据,而无需人工输入规则或训练集等任何人工输入。提取结构化数据使我们对数据具有更大的查询能力,并且在信息集成系统中很有用。我们的方法包括两个阶段。在第一阶段,推导用于创建页面的未知模板。在第二阶段,使用推导的模板提取值。我们将重点放在第一阶段,因为它更具挑战性。完整版包含高相关性的正式定义和我们的算法。我们通过考虑9个页面的真实集合来评估我们的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号