首页> 外文会议>New frontiers in applied data mining. >A Fusion of Algorithms in Near Duplicate Document Detection
【24h】

A Fusion of Algorithms in Near Duplicate Document Detection

机译:近似重复文档检测中的算法融合

获取原文
获取原文并翻译 | 示例

摘要

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some "state of the art" algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the "China-US Million Book Digital Library Project"'. The experiment result proves the efficiency of these algorithms.
机译:随着万维网的飞速发展,Internet上有大量完全或不完整的重复页面。将这些几乎重复的结果返回给用户会极大地影响用户体验。在部署数字图书馆的过程中,需要考虑保护知识产权和删除重复的内容。本文融合了一些“最新技术”算法以达到更好的性能。在接下来的几天中,我们首先介绍重复文档检测中的三种主要算法(重迭,I-match,simhash)及其发展。我们将单词(带状疱疹)序列作为simhash算法的特征。然后,将基于随机词典的多指纹生成算法导入到基于瓦的混合simhash算法中,并命名为基于瓦的多指纹simhash算法。我们基于“中美百万图书数字图书馆计划”对合成数据集进行了一些初步实验。实验结果证明了这些算法的有效性。

著录项

  • 来源
  • 会议地点 Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN);Shenzhen(CN)
  • 作者

    Jun Fan; Tiejun Huang;

  • 作者单位

    National Engineering Laboratory for Video Technology, School of EE CS, Peking University, Beijing 100871, China,Peking University Shenzhen Graduate School, Shenzhen 518055, China;

    National Engineering Laboratory for Video Technology, School of EE CS, Peking University, Beijing 100871, China,Peking University Shenzhen Graduate School, Shenzhen 518055, China;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 TP311.13;TP311.13;
  • 关键词

    duplicate document detection; digital library; web pages; near dup-licate document;

    机译:重复文件检测;数字图书馆网页;几乎重复的文件;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号