首页> 外国专利> METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS

METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS

机译:格式文本文档中的通用结构识别方法

摘要

A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.
机译:一种计算机实现的方法,计算机程序产品和数据处理系统,用于识别跨多个格式化文本文档共享的通用结构。通用结构以一系列地标表示,每个地标都有一个开始和结束标记来描述文本的边界。通过计算文档中重复文本段的出现次数来确定通用结构。经常同时出现的相邻片段成为地标标记的候选项。另外,提取地标内的文本内容的样式信息,并将其映射到规则。这些规则用于合并和总结来自多个文档的内容,这比当前内容串联的做法更具优势。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号