首页>
外国专利>
METHODS AND SYSTEMS FOR EFFICIENT AND ACCURATE TEXT EXTRACTION FROM UNSTRUCTURED DOCUMENTS
METHODS AND SYSTEMS FOR EFFICIENT AND ACCURATE TEXT EXTRACTION FROM UNSTRUCTURED DOCUMENTS
展开▼
机译:从非结构化文档中高效准确地提取文本的方法和系统
展开▼
页面导航
摘要
著录项
相似文献
摘要
According to one aspect, the subject matter described herein includes a method for extracting text from unstructured documents. The method includes creating a spatial index for storing information about words on a page of a document to be analyzed; using the spatial index to detect white space that indicates boundaries of columns within the page, aggregate words into lines, identify lines that are part of a header or footer of the page, and identify lines that are part of a table or a figures within the page; and joining lines together to generate continuous text flows. In one embodiment, the continuous text is divided into sections. In one embodiment, references within the document are identified. In one embodiment, inline citations within the document body are replaced with the corresponding reference information, or portions thereof.
展开▼