首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Semisupervised Wrapper Choice and Generation for Print-Oriented Documents
【24h】

Semisupervised Wrapper Choice and Generation for Print-Oriented Documents

机译:面向打印文档的半监督包装器选择和生成

获取原文
获取原文并翻译 | 示例

摘要

Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists, and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of normal system operation: new wrappers are generated online based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging data set composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, and patents. We also perform an extensive analysis of the crucial tradeoff between accuracy and automation level.
机译:从印刷文档中提取信息仍然是许多组织间工作流程中的关键问题。用于其他应用程序域(例如Web)的解决方案不太适合这种特殊情况,因为印刷文档没有任何明确的结构或语法描述。而且,印刷文件通常缺乏任何关于其来源的明确指示。我们提出了一个称为PATO的系统,用于在动态多源方案中从打印文档中提取预定义的项目。 PATO选择每个文档所需的特定于源的包装器,确定是否不存在合适的包装器,并在必要时生成一个包装器。 PATO假定对特定于源的新包装器的需求是正常系统操作的一部分:新包装器是基于操作员在GUI上执行的几次点击操作而在线生成的。操作员的角色是设计不可或缺的一部分,PATO可以配置为适应广泛的自动化级别。我们显示,PATO在具有挑战性的数据集上表现出非常好的性能,该数据集由600个打印文档组成,这些文档来自三个不同的应用领域:发票,电子部件数据表和专利。我们还对精度和自动化水平之间的关键权衡进行了广泛的分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号