...
首页> 外文期刊>Procedia Computer Science >Semantic PDF Segmentation for Legacy Documents in Technical Documentation
【24h】

Semantic PDF Segmentation for Legacy Documents in Technical Documentation

机译:技术文档中旧文档的语义PDF分割

获取原文
           

摘要

The most common format to store and provide technical documentation is PDF. However, due to the unstructured nature of the format these documents are often excluded from a granular semantic access. While more and more companies are implementing XML-based component content management systems which can deliver annotated structured content, older legacy documents remain in their monolithic form.We developed a new approach which segments PDF documents into semantically related sections via classification knowledge gained from structured training content. This approach based on machine learning is independent from any formatting information or visual clues.In this paper, we take the results from multiple previous works and combine them into a holistic procedure model. We introduce a parameterizable range finding algorithm to refine segment detection and provide a RDF-based format to exchange the generated metadata which can then be used to improve information retrieval for users.
机译:存储和提供技术文档的最常见格式是PDF。但是,由于格式的非结构化性质,这些文档通常被排除在粒度语义访问之外。当越来越多的公司正在实施可提供带注释的结构化内容的基于XML的组件内容管理系统时,旧的旧文档仍保持其整体形式。内容。这种基于机器学习的方法独立于任何格式信息或视觉线索。我们引入了可参数化的测距算法,以细化段检测,并提供基于RDF的格式来交换生成的元数据,然后可将其用于改善用户的信息检索。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号