首页> 外文期刊>Data & Knowledge Engineering >Information extraction from structured documents using k-testable tree automaton inference
【24h】

Information extraction from structured documents using k-testable tree automaton inference

机译:使用k可测树自动机推理从结构化文档中提取信息

获取原文
获取原文并翻译 | 示例

摘要

Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree automata, which are like finite state automata but parse trees instead of strings. In this work, we explore induction of k-testable ranked tree automata from a small set of annotated examples. We describe three variants which differ in the way they generalize the inferred automaton. Experimental results on a set of benchmark data sets show that our approach compares favorably to string-based approaches. However, the quality of the extraction is still suboptimal.
机译:信息提取(IE)解决了从文档集合中提取特定信息的问题。先前有关结构化文档(例如HTML或XML)的IE的许多工作都使用基于字符串的学习技术,例如有限自动机归纳法。这些方法不利用文档的树结构。一种自然的方法是诱导树自动机,就像有限状态自动机一样,但是解析树而不是字符串。在这项工作中,我们从一小批带注释的示例中探索了k可测排名树自动机的归纳。我们描述了三种变体,它们在概括推断的自动机的方式上有所不同。在一组基准数据集上的实验结果表明,我们的方法优于基于字符串的方法。但是,提取的质量仍然不是最佳的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号