首页> 外国专利> DOCUMENT SIMILARITY CALCULATION METHOD, AND METHOD AND DEVICE FOR DETECTING APPROXIMATELY DUPLICATE DOCUMENTS

DOCUMENT SIMILARITY CALCULATION METHOD, AND METHOD AND DEVICE FOR DETECTING APPROXIMATELY DUPLICATE DOCUMENTS

机译:文档相似度计算方法,以及用于检测近似重复文档的方法和设备

摘要

The present invention relates to a document similarity calculation method, and a method and device for detecting approximately duplicate documents. The calculation method comprises: respectively conducting word segmentation processing on two documents to be detected to obtain respective word segmentation sets of the documents to be detected; calculating the editing similarity of all word segmentation pairs in the two word segmentation sets, wherein two pieces of word segmentation of each of the word segmentation pairs respectively come from two of the word segmentation sets; establishing an edge between the word segmentation pairs of which the editing similarity satisfies the requirements in all the word segmentation pairs, wherein the editing similarity is the weight of the edge corresponding to the word segmentation pairs, and then, obtaining a weighted bipartite graph; calculating the maximum weighted matching value of the weighted bipartite graph; and using the maximum weighted matching value to calculate the similarity between the documents to be detected. The document similarity calculation method, and the method and device for detecting approximately duplicate documents provided in the present invention have a high accuracy rate and can effectively identify approximately duplicate documents containing incorrectly edited word segmentation sets, thereby improving the detection accuracy of the approximately duplicate documents, reducing the calculation complexity and optimizing the calculation efficiency.
机译:文档相似度计算方法以及用于检测近似重复文档的方法和装置技术领域本发明涉及一种文档相似度计算方法以及用于检测近似重复文档的方法和装置。所述计算方法包括:分别对两个待检测文档进行分词处理,得到待检测文档的各个分词集;计算两个分词集合中所有分词对的编辑相似度,其中每个分词对的两个分词分别来自两个分词集合;在编辑相似度满足所有分词对要求的分词对之间建立边缘,所述编辑相似度为所述分词对对应的边缘权重,得到加权二部图;计算加权二部图的最大加权匹配值;使用最大加权匹配值计算待检测文档之间的相似度。本发明提供的文档相似度计算方法以及检测近似重复文档的方法和装置具有较高的准确率,可以有效地识别包含错误编辑的词分割集的近似重复文档,从而提高了近似重复文档的检测精度。 ,降低了计算复杂度,优化了计算效率。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号