首页> 外国专利> CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

机译:聚类文件

摘要

Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.
机译:基于表示相对低维空间中单词出现模式的文档向量,将可能重复的文档聚类。文档之间的编辑距离是基于比较文档矢量来定义的。在一个过程中,初始簇是通过相对于每个簇的根文档应用第一编辑距离约束而形成的。可以根据第二个编辑距离约束来合并初始聚类,该第二个编辑距离约束限制了聚类中任何两个文档之间的最大编辑距离。可以定义第二编辑距离约束,以便可以通过比较群集结构而不是单个文档来确定是否满足。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号