首页> 外国专利> METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS

METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS

机译：格式文本文档中的通用结构识别方法

页面导航

摘要
著录项
相似文献

摘要

A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.

机译：一种计算机实现的方法，计算机程序产品和数据处理系统，用于识别跨多个格式化文本文档共享的通用结构。通用结构以一系列地标表示，每个地标都有一个开始和结束标记来描述文本的边界。通过计算文档中重复文本段的出现次数来确定通用结构。经常同时出现的相邻片段成为地标标记的候选项。另外，提取地标内的文本内容的样式信息，并将其映射到规则。这些规则用于合并和总结来自多个文档的内容，这比当前内容串联的做法更具优势。

著录项

公开/公告号US2011137900A1

专利类型
公开/公告日2011-06-09

原文格式PDF
申请/专利权人 YUAN-CHI CHANG;DEBDOOT MUKHERJEE;VIBHA SINGHAL SINHA;BIPLAV SRIVASTAVA;
展开▼

申请/专利号US20090634176
发明设计人 YUAN-CHI CHANG;DEBDOOT MUKHERJEE;VIBHA SINGHAL SINHA;BIPLAV SRIVASTAVA;
展开▼

申请日2009-12-09
分类号G06F17;G06F17/21;G06F17/30;
国家 US
入库时间 2022-08-21 18:12:11

相似文献

专利
外文文献
中文文献