首页>
外国专利>
DOCUMENT INFORMATION EXTRACTION METHOD AND SYSTEM BASED ON BODY TEXT IDENTIFICATION
DOCUMENT INFORMATION EXTRACTION METHOD AND SYSTEM BASED ON BODY TEXT IDENTIFICATION
展开▼
机译:基于正文文本识别的文档信息提取方法和系统
展开▼
页面导航
摘要
著录项
相似文献
摘要
A document information extraction method and a system based on text identification are provided to prevent the unrelated string to the user willing of the web document from being expressed by excluding the expression without the main information from the title substitute expression. A document is parsed(S110), and the document is divided according to the sections by referring to the parsing information(S120). The text section of the document is recognized according to the predetermined reference containing at least one of the ratio of the text without a link attribute to each section, the quantity of the section occupied in the total document, the section size, and the section position information(S130). The position of the text content is recognized according to the predetermined reference containing at least one of the position in which a line is changed in the text content, and the text width.
展开▼