首页> 外国专利> Method and apparatus for detecting and summarizing document similarity within large document sets

Method and apparatus for detecting and summarizing document similarity within large document sets

机译:用于检测和总结大型文档集中的文档相似性的方法和设备

摘要

A method and apparatus are disclosed for comparing an input or query file to a set of files to detect similarities and formatting the output comparison data are described. An input query file that can be segmented into multiple query file substrings is received. A query file substring is selected and used to search a storage area containing multiple ordered file substrings that were taken from previously analyzed files. If the selected query file substring matches any of the multiple ordered file substrings, match data relating to the match between the selected query file substring and the matching ordered file substring is stored in a temporary file. The matching ordered file substring and another ordered file substring are joined if the matching ordered file substring and the second ordered file substring are in a particular sequence and if the selected query file substring and a second query file substring are in the same particular sequence. If the matching ordered file substring and the second query file substring match, a coalesced matching ordered substring and a coalesced query file substring are formed that can be used to format output comparison data.
机译:公开了一种用于将输入或查询文件与一组文件进行比较以检测相似性并格式化输出比较数据的方法和装置。接收可以细分为多个查询文件子字符串的输入查询文件。选择查询文件子字符串,并将其用于搜索包含从先前分析的文件中获取的多个有序文件子字符串的存储区域。如果所选查询文件子字符串与多个有序文件子字符串中的任何一个匹配,则与所选查询文件子字符串和匹配的有序文件子字符串之间的匹配有关的匹配数据存储在临时文件中。如果匹配的有序文件子字符串和第二个有序文件子字符串以特定顺序排列,并且所选查询文件子字符串和第二个查询文件子字符串以相同的特定顺序排列,则将匹配的有序文件子字符串和另一个有序文件子字符串连接在一起。如果匹配的有序文件子字符串和第二个查询文件子字符串匹配,则会形成合并的匹配的有序子字符串和合并的查询文件子字符串,可用于格式化输出比较数据。

著录项

  • 公开/公告号US6240409B1

    专利类型

  • 公开/公告日2001-05-29

    原文格式PDF

  • 申请/专利权人 THE REGENTS OF THE UNIVERSITY OF CALIFORNIA;

    申请/专利号US19980127105

  • 发明设计人 ALEXANDER AIKEN;

    申请日1998-07-31

  • 分类号G06F173/00;

  • 国家 US

  • 入库时间 2022-08-22 01:04:11

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号