首页> 外文会议>International Conference on Knowledge Discovery and Information Retrieval >Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation
【24h】

Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

机译:从葡萄牙立法的PDF文件中提取结构,文本和实体

获取原文

摘要

This paper presents an approach for text processing of PDF documents with well-defined layout structure. The scope of the approach is to explore the font's structure of PDF documents, using perceptual grouping. It consists on the extraction of text objects from the content stream of the documents and its grouping according to a set criterion, making also use of geometric-based regions in order to achieve the correct reading order. The developed approach processes the PDF documents using logical and structural rules to extract the entities present in them, and returns an optimized XML representation of the PDF document, useful for re-use, for example in text categorization. The system was trained and tested with Portuguese Legislation PDF documents extracted from the electronic Republic's Diary. Evaluation results show that our approach presents good results.
机译:本文介绍了具有明确定义布局结构的PDF文档文本处理的方法。方法的范围是使用感知分组探索PDF文件的字体结构。它包括根据设定标准从文档的内容流提取文本对象,也可以使用基于几何区域以实现正确的阅读顺序。开发方法使用逻辑和结构规则来处理PDF文档以提取它们中存在的实体,并返回PDF文档的优化XML表示,可用于重复使用,例如文本分类。该系统培训并使用从电子共和国日记中提取的葡萄牙立法PDF文件进行了培训和测试。评估结果表明,我们的方法呈现出良好的效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号