首页> 外国专利> CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

机译：聚类文件

页面导航

摘要
著录项
相似文献

摘要

Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.

机译：基于表示相对低维空间中单词出现模式的文档向量，将可能重复的文档聚类。文档之间的编辑距离是基于比较文档矢量来定义的。在一个过程中，初始簇是通过相对于每个簇的根文档应用第一编辑距离约束而形成的。可以根据第二个编辑距离约束来合并初始聚类，该第二个编辑距离约束限制了聚类中任何两个文档之间的最大编辑距离。可以定义第二编辑距离约束，以便可以通过比较群集结构而不是单个文档来确定是否满足。

著录项

公开/公告号US2011087668A1

专利类型
公开/公告日2011-04-14

原文格式PDF
申请/专利权人 JOY THOMAS;SAURAJ GOSWAMI;VAMSI SALAKA;
展开▼

申请/专利号US20100870733
发明设计人 JOY THOMAS;SAURAJ GOSWAMI;VAMSI SALAKA;
展开▼

申请日2010-08-27
分类号G06F17/30;
国家 US
入库时间 2022-08-21 18:15:12

相似文献

专利
外文文献
中文文献