首页> 外文期刊>International Journal of Data Mining & Knowledge Management Process >A Near-Duplicate Detection Algorithm to Facilitate Document Clustering
【24h】

A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

机译:一种促进文档聚类的近重复检测算法

获取原文
           

摘要

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting NearDuplicates is very difficult in large collection of data like ”internet”. The presence of these web pagesplays an important role in the performance degradation while integrating data from heterogeneoussources. These pages either increase the index storage space or increase the serving costs. Detecting thesepages has many potential applications for example may indicate plagiarism or copyright infringement.This paper concerns detecting, and optionally removing duplicate and near duplicate documents which areused to perform clustering of documents .We demonstrated our approach in web news articles domain. Theexperimental results show that our algorithm outperforms in terms of similarity measures. The nearduplicate and duplicate document identification has resulted reduced memory in repositories.
机译:由于重复和近乎重复的网页,Web Ming面临巨大的问题。在“互联网”等大型数据收集中,检测NearDuplicates非常困难。这些网页的存在在集成来自异构源的数据时,在性能下降中起着重要作用。这些页面要么增加索引存储空间,要么增加服务成本。检测这些页面具有许多潜在的应用,例如,可能表明抄袭或侵犯版权。本文涉及检测并有选择地删除用于执行文档聚类的重复和接近重复的文档。实验结果表明,我们的算法在相似度方面优于传统算法。几乎重复和重复的文档标识已导致存储库中的内存减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号