【24h】

Boosted Wrapper Induction

机译:提升包装纸归纳

获取原文

摘要

Recent work in machine learning for information extraction has focused on two distince sub-=problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures("wrappers") for highly structured text such as Web pages produced by CGI scripts. For suitably regular domains, existing wrapper induction algorithms can efficiently learn wrappers that are simple and highly accurate, but the regularity bias of these algorithms makes them unsuitable for most conventional information extraction tasks. Boosting is a technique for improving the performance of a simple machine learning algorithm by repeatedly applying it to the training set with different example weightings. We describe an algorithm that learns simple, low-voverage wrapper-like extraction patterns, which we then apply to comventional information extraction problems using boosting. The result is BWI, a trainable information extraction system with a strong precision bias and F1 performance better than state-of-the-art techniques in many domains.
机译:信息提取的机器学习中的最新工作集中在两个Subs-=问题上:从自然语言文本中填充模板插槽的传统问题,以及包装归纳,学习简单的提取程序(“包装”)的高度结构化文本如CGI脚本生产的网页。对于适当的域,现有的包装器感应算法可以有效地学习简单且高度准确的包装器,但这些算法的规律性偏差使它们不适合大多数传统信息提取任务。升压是一种用于通过多次将其应用于具有不同示例权重的训练集来提高简单机器学习算法的性能的技术。我们描述了一种学习简单,低vogerage包装的提取模式的算法,我们将应用于使用升压的议程信息提取问题。结果是BWI,一种可训练信息提取系统,具有强度偏差和F1性能,比许多域中的最先进技术更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号