首页> 外国专利> DOCUMENT STRUCTURE EXTRACTING DEVICE AND DOCUMENT STRUCTURE INFORMATION EXTRACTING METHOD

DOCUMENT STRUCTURE EXTRACTING DEVICE AND DOCUMENT STRUCTURE INFORMATION EXTRACTING METHOD

机译:文档结构提取设备和文档结构信息提取方法

摘要

PROBLEM TO BE SOLVED: To provide a document structure extracting device capable of extracting a document structure from an electronic document without using a dictionary. SOLUTION: Concerning the document structure extracting device for extracting document structure information from the electronic document, this device is provided with a character information generating part 103 for generating character information containing information on the position, character size and character type of each character by analyzing the document, a line information generating part 105 for generating line information containing information on the character string of each line, the main character size and main character type of each line and the score of each line by analyzing the character information and a document information generating part 107 for generating the document structure information by analyzing this line information. The document structure information generating part 107 generates the document structure information by grouping the lines on the basis of the score of the line information and the continuity of lines. Thus, the document structure information can be extracted from the electronic document without using the dictionary.
机译:解决的问题:提供一种能够在不使用词典的情况下从电子文档中提取文档结构的文档结构提取装置。 SOLUTION:关于用于从电子文档中提取文档结构信息的文档结构提取设备,该设备配备有字符信息生成部分103,用于通过分析字符的位置,字符大小和字符类型的信息来生成字符信息,该信息包含有关每个字符的位置,字符大小和字符类型的信息文档,行信息生成部分105,用于通过分析字符信息来生成行信息,该行信息包含关于每行的字符串,每行的主要字符大小和主要字符类型以及每行的分数的信息,以及文档信息生成部分107,用于通过分析该行信息来生成文档结构信息。文档结构信息生成部分107通过基于行信息的得分和行的连续性来对行进行分组来生成文档结构信息。因此,可以在不使用词典的情况下从电子文档中提取文档结构信息。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号