...
首页> 外文期刊>Computer and information science >Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
【24h】

Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation

机译:扫描PDF的布局分析并转换为适合语音和导航的结构化PDF

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document's arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions.
机译:信息可以包括文本,图片和签名,可以将其扫描成文档格式,例如可移植文档格式(PDF),并可以通过电子邮件轻松地发送给世界各地的收件人。收到文档后,接收者可以使用各种不同的PDF查看应用程序(例如Adobe Reader和Apple Preview)打开并查看它。因此,今天,PDF的使用已变得无处不在。由于扫描的PDF是图像格式,因此诸如屏幕阅读器之类的辅助技术无法访问它。因此,信息的检索需要光学字符识别(OCR)。 OCR软件扫描扫描的PDF文件,并通过文本提取生成可编辑的文本格式文档。然后可以对该文本文档进行编辑,格式化,搜索和索引以及翻译或转换为语音。 OCR软件无法解决的问题是准确重新生成全文版式。本文提出了一种技术,通过使用开放式文档分析和基于几何布局和位置信息的OCR系统(OCRopus)来紧密保留扫描的PDF的原始文本布局,从而解决了这一问题。本研究中考虑的主要问题是保持正确的阅读顺序,以及常见逻辑结构元素的表示,例如节标题,换行符,段落,标题和侧边栏,脚栏,运行标题,嵌入的图像,图形,表格和数学表达式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号