Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

机译：从葡萄牙立法的PDF文件中提取结构，文本和实体

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents an approach for text processing of PDF documents with well-defined layout structure. The scope of the approach is to explore the font's structure of PDF documents, using perceptual grouping. It consists on the extraction of text objects from the content stream of the documents and its grouping according to a set criterion, making also use of geometric-based regions in order to achieve the correct reading order. The developed approach processes the PDF documents using logical and structural rules to extract the entities present in them, and returns an optimized XML representation of the PDF document, useful for re-use, for example in text categorization. The system was trained and tested with Portuguese Legislation PDF documents extracted from the electronic Republic's Diary. Evaluation results show that our approach presents good results.

机译：本文介绍了具有明确定义布局结构的PDF文档文本处理的方法。方法的范围是使用感知分组探索PDF文件的字体结构。它包括根据设定标准从文档的内容流提取文本对象，也可以使用基于几何区域以实现正确的阅读顺序。开发方法使用逻辑和结构规则来处理PDF文档以提取它们中存在的实体，并返回PDF文档的优化XML表示，可用于重复使用，例如文本分类。该系统培训并使用从电子共和国日记中提取的葡萄牙立法PDF文件进行了培训和测试。评估结果表明，我们的方法呈现出良好的效果。

著录项

来源
《International Conference on Knowledge Discovery and Information Retrieval》|2012年||共9页
会议地点
作者
Nuno Moniz; Fatima Rodrigues;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 G354-53;
关键词
Information Retrieval; Text Extraction; PDF;

机译：信息检索;文本提取;PDF;

相似文献

外文文献
中文文献
专利

1. An innovative hybrid approach for extracting named entities from unstructured text data [J] . Thomas Anu, Sangeetha S. Computational Intelligence . 2019,第4期

机译：一种创新的混合方法，用于从非结构化文本数据中提取命名实体
2. Robust and Secure Data Hiding for PDF Text Document [J] . Minoru KURIBAYASHI, Takuya FUKUSHIMA, Nobuo FUNABIKI IEICE transactions on information and systems . 2019,第1期

机译：用于PDF文本文档的强大而安全的数据隐藏
3. Digitization of Text Documents Using PDF/A [J] . Yan Han, Xueheng Wan Information technology and libraries . 2018,第1期

机译：使用PDF / A对文本文档进行数字化
4. Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation [C] . Nuno Moniz, Fatima Rodrigues International Conference on Knowledge Discovery and Information Retrieval . 2012

机译：从葡萄牙立法的PDF文件中提取结构，文本和实体
5. Extracting the structure and conformations of biological entities from large datasets. [D] . Dashti, Ali. 2013

机译：从大型数据集中提取生物实体的结构和构象。
6. Desktop document delivery using portable document format (PDF) files and the Web. [O] . J P Shipman, W L Gembala, J M Reeder, 1998

机译：使用可移植文档格式（PDF）文件和Web进行桌面文档传递。
7. Extracting Body Text from Academic PDF Documents for Text Mining [O] . Changfeng Yu, Cheng Zhang, Jie Wang 2020

机译：从学术PDF文件中提取正文文本的文本挖掘

Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

摘要

著录项

相似文献

相关主题

期刊订阅