...
首页> 外文期刊>International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms >FIRST-ORDER LOGIC RULE INDUCTION FOR INFORMATION EXTRACTION IN WEB RESOURCES
【24h】

FIRST-ORDER LOGIC RULE INDUCTION FOR INFORMATION EXTRACTION IN WEB RESOURCES

机译:Web资源中信息提取的一阶逻辑规则诱导

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Information extraction out of web pages, commonly known as screen scraping, is usually performed through wrapper induction, a technique that is based on the internal structure of HTML documents. As such, the main limitation of these kinds of techniques is that a generated wrapper is only useful for the web page it was designed for. To overcome this, in this paper it is proposed a system that generates first-order logic rules that can be used to extract data from web pages. These rules are based on visual features such as font size, elements positioning or types of contents. Thus, they do not depend on a document's internal structure, and are able to work on different sites. The system has been validated on a set of different web pages, showing very high precision and good recall, which validates the robustness and the generalization capabilities of the approach.
机译:从网页中提取信息(通常称为屏幕抓取)通常是通过包装器归纳来完成的,该技术是基于HTML文档的内部结构的。这样,这些技术的主要局限性在于,生成的包装器仅对为其设计的网页有用。为了克服这个问题,本文提出了一种生成一阶逻辑规则的系统,该规则可用于从网页中提取数据。这些规则基于视觉特征,例如字体大小,元素位置或内容类型。因此,它们不依赖于文档的内部结构,并且能够在不同的站点上工作。该系统已在一组不同的网页上进行了验证,显示出很高的精度和良好的召回率,从而验证了该方法的鲁棒性和泛化能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号