首页> 外国专利> METHODS AND SYSTEMS FOR EFFICIENT AND ACCURATE TEXT EXTRACTION FROM UNSTRUCTURED DOCUMENTS

METHODS AND SYSTEMS FOR EFFICIENT AND ACCURATE TEXT EXTRACTION FROM UNSTRUCTURED DOCUMENTS

机译:从非结构化文档中高效准确地提取文本的方法和系统

摘要

According to one aspect, the subject matter described herein includes a method for extracting text from unstructured documents. The method includes creating a spatial index for storing information about words on a page of a document to be analyzed; using the spatial index to detect white space that indicates boundaries of columns within the page, aggregate words into lines, identify lines that are part of a header or footer of the page, and identify lines that are part of a table or a figures within the page; and joining lines together to generate continuous text flows. In one embodiment, the continuous text is divided into sections. In one embodiment, references within the document are identified. In one embodiment, inline citations within the document body are replaced with the corresponding reference information, or portions thereof.
机译:根据一个方面,本文描述的主题包括一种用于从非结构化文档中提取文本的方法。该方法包括创建空间索引,该空间索引用于存储与要分析的文档的页面上的单词有关的信息;以及使用空间索引来检测表示页面内各列边界的空白,将单词聚合为行,识别属于页面页眉或页脚的行,并识别属于表格或图形内的行页;并将行连接在一起以生成连续的文本流。在一实施例中,连续文本被分成多个部分。在一个实施例中,标识了文档内的参考。在一实施例中,文档主体内的在线引用被相应的参考信息或其部分代替。

著录项

  • 公开/公告号US2016314104A1

    专利类型

  • 公开/公告日2016-10-27

    原文格式PDF

  • 申请/专利权人 SCIOME LLC;

    申请/专利号US201615136674

  • 申请日2016-04-22

  • 分类号G06F17/22;G06F17/30;G06F17/27;

  • 国家 US

  • 入库时间 2022-08-21 14:38:51

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号