首页>
外国专利>
DOCUMENT SIMILARITY CALCULATION METHOD, AND METHOD AND DEVICE FOR DETECTING APPROXIMATELY DUPLICATE DOCUMENTS
DOCUMENT SIMILARITY CALCULATION METHOD, AND METHOD AND DEVICE FOR DETECTING APPROXIMATELY DUPLICATE DOCUMENTS
展开▼
机译:文档相似度计算方法,以及用于检测近似重复文档的方法和设备
展开▼
页面导航
摘要
著录项
相似文献
摘要
The present invention relates to a document similarity calculation method, and a method and device for detecting approximately duplicate documents. The calculation method comprises: respectively conducting word segmentation processing on two documents to be detected to obtain respective word segmentation sets of the documents to be detected; calculating the editing similarity of all word segmentation pairs in the two word segmentation sets, wherein two pieces of word segmentation of each of the word segmentation pairs respectively come from two of the word segmentation sets; establishing an edge between the word segmentation pairs of which the editing similarity satisfies the requirements in all the word segmentation pairs, wherein the editing similarity is the weight of the edge corresponding to the word segmentation pairs, and then, obtaining a weighted bipartite graph; calculating the maximum weighted matching value of the weighted bipartite graph; and using the maximum weighted matching value to calculate the similarity between the documents to be detected. The document similarity calculation method, and the method and device for detecting approximately duplicate documents provided in the present invention have a high accuracy rate and can effectively identify approximately duplicate documents containing incorrectly edited word segmentation sets, thereby improving the detection accuracy of the approximately duplicate documents, reducing the calculation complexity and optimizing the calculation efficiency.
展开▼