首页>
外国专利>
METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS
METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS
展开▼
机译:格式文本文档中的通用结构识别方法
展开▼
页面导航
摘要
著录项
相似文献
摘要
A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.
展开▼