首页> 外文会议>Natural language processing and chinese computing >Integration of Text Information and Graphic Composite for PDF Document Analysis
【24h】

Integration of Text Information and Graphic Composite for PDF Document Analysis

机译:文本信息和图形合成的集成,用于PDF文档分析

获取原文
获取原文并翻译 | 示例

摘要

The trend of large scale digitization has greatly motivated the research on the processing of the PDF documents with little structure information.Challenging problems like graphic segmentation integrating with texts remain unsolved for successful practical application of PDF layout analysis.To cope with PDF documents,a hybrid method incorporating text information and graphic composite is proposed to segment the pages that are difficult to handle by traditional methods.Specifically,the text information is derived accurately from born-digital documents embedded with low-level structure elements in explicit form.Then page text elements are clustered by applying graph based method according to proximity and feature similarity.Meanwhile,the graphic components are extracted by means of texture and morphological analysis.By integrating the clustered text elements with image based graphic components,the graphics are segmented for layout analysis.The experimental results on pages of PDF books have shown satisfactory performance.
机译:大规模数字化的趋势极大地推动了对结构信息少的PDF文档处理的研究。对于成功地在PDF布局分析中进行实际应用,仍未解决诸如图形分割与文本集成等具有挑战性的问题。提出了一种结合文本信息和图形合成的方法来分割传统方法难以处理的页面。具体地说,文本信息是从以显式形式嵌入低级结构元素的出生数字文档中准确得出的。根据接近度和特征相似度,采用基于图的方法对图像进行聚类。同时,通过纹理和形态分析的方法提取图形成分。通过将聚类的文本元素与基于图像的图形成分进行集成,对图形进行分割以进行布局分析。 PDF图书页面上的实验结果表现令人满意。

著录项

  • 来源
  • 会议地点 Beijing(CN)
  • 作者单位

    Institute of Computer Science and Technology, Peking University, Beijing, China;

    Postdoctoral Workstation of the Zhongguancun Haidian Science Park and Peking University Founder Group Co. Ltd, Beijing, China;

    State Key Laboratory of Digital Publishing Technology, Beijing, China;

    Institute of Computer Science and Technology, Peking University, Beijing, China;

    State Key Laboratory of Digital Publishing Technology, Beijing, China;

    Institute of Computer Science and Technology, Peking University, Beijing, China;

    State Key Laboratory of Digital Publishing Technology, Beijing, China;

    Institute of Computer Science and Technology, Peking University, Beijing, China;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 数理语言学;
  • 关键词

    PDF document; graphic segmentation; graph based method; text clustering;

    机译:PDF文档;图形分割;基于图的方法;文本聚类;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号