首页> 外文会议>Document recognition and retrieval XVII >Using definite clause grammars to build a global system for analyzing collections of documents
【24h】

Using definite clause grammars to build a global system for analyzing collections of documents

机译:使用定句语法建立一个用于分析文档集合的全局系统

获取原文
获取原文并翻译 | 示例

摘要

Collections of documents are sets of heterogeneous documents, like a specific ancient book series, having proper structural and semantic properties linking them. A particular collection contains document images with specific physical layouts, like text pages or full-page illustrations, appearing in a specific order. Its contents, like journal articles, may be shared by several pages, not necessary following, producing strong dependencies between pages interpretations. In order to build an analysis system which can bring contextual information from the collection to the appropriate recognition modules for each page, we propose to express the structural and the semantic properties of a collection with a definite clause grammar. This is made possible by representing collections as streams of document images, and by using extensions to the formalism we present here. We are then able to automatically generate a parser dedicated to a collection. Beside allowing structural variations and complex information flows, we also show that this approach enables the design of analysis stages, on a document or a set of documents. The interest of context usage is illustrated with several examples and their appropriate formalization in this framework.
机译:文档集合是异类文档的集合,例如特定的古代书籍系列,具有适当的结构和语义属性将它们链接在一起。特定的集合包含具有特定物理布局的文档图像,例如以特定顺序显示的文本页面或整页插图。它的内容(如期刊文章)可能被几页共享,而不必紧随其后,从而在页解释之间产生强烈的依赖性​​。为了构建一个可以将上下文信息从集合带到每个页面的适当识别模块的分析系统,我们建议使用定句语法来表示集合的结构和语义属性。通过将集合表示为文档图像流,并使用我们在此处介绍的形式主义的扩展,可以实现这一点。然后,我们能够自动生成专用于集合的解析器。除了允许结构变化和复杂的信息流外,我们还表明该方法还可以设计文档或一组文档上的分析阶段。通过几个示例及其在此框架中的适当形式化说明了上下文使用的兴趣。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号