A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

Lavanya Pamulaparty; Dr. C.V Guru Rao; Dr. M. Sreenivasa Rao

首页> 外文期刊>International Journal of Data Mining & Knowledge Management Process >A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

【24h】

A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

机译：一种促进文档聚类的近重复检测算法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting NearDuplicates is very difficult in large collection of data like ”internet”. The presence of these web pagesplays an important role in the performance degradation while integrating data from heterogeneoussources. These pages either increase the index storage space or increase the serving costs. Detecting thesepages has many potential applications for example may indicate plagiarism or copyright infringement.This paper concerns detecting, and optionally removing duplicate and near duplicate documents which areused to perform clustering of documents .We demonstrated our approach in web news articles domain. Theexperimental results show that our algorithm outperforms in terms of similarity measures. The nearduplicate and duplicate document identification has resulted reduced memory in repositories.

机译：由于重复和近乎重复的网页，Web Ming面临巨大的问题。在“互联网”等大型数据收集中，检测NearDuplicates非常困难。这些网页的存在在集成来自异构源的数据时，在性能下降中起着重要作用。这些页面要么增加索引存储空间，要么增加服务成本。检测这些页面具有许多潜在的应用，例如，可能表明抄袭或侵犯版权。本文涉及检测并有选择地删除用于执行文档聚类的重复和接近重复的文档。实验结果表明，我们的算法在相似度方面优于传统算法。几乎重复和重复的文档标识已导致存储库中的内存减少。

著录项

来源
《International Journal of Data Mining & Knowledge Management Process》 |2014年第6期|共1页
作者
Lavanya Pamulaparty; Dr. C.V Guru Rao; Dr. M. Sreenivasa Rao;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection [J] . Phuc-TranHo, Sung-RyulKim International Journal of Distributed Sensor Networks . 2014,第3期

机译：基于指纹的近重复文档检测及其在SNS垃圾邮件检测中的应用
2. Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash [J] . N. Rezaeian, G.M. Novikova Procedia Computer Science . 2017,第1期

机译：通过使用指纹算法Simhash检测俄语文档中的近重复项
3. Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL [J] . Ercan Canhasi Journal of computer sciences . 2018,第5期

机译：通过OpenCL在几乎重复的文档检测中评估CPU，GPU和FPGA的效率
4. Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages [C] . Wei Yongzhuang, Wang Shuai, Yuan Chunfeng, International Conference on Parallel and Distributed Computing, Applications and Technologies . 2012

机译：大规模中文网页的并行近重复文档检测算法
5. Comparison of clustering algorithms and its application to document clustering. [D] . Chen, Jie. 2005

机译：聚类算法的比较及其在文档聚类中的应用。
6. Swarm Intelligence Algorithms in Text Document Clustering with Various Benchmarks [O] . Suganya Selvaraj, Eunmi Choi 2021

机译：文本文档集群中的群智能算法与各种基准
7. XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning [O] . Pamulaparty Lavanya, Guru Rao C.V., Rao M. Sreenivasa 2015

机译：XNDDF：建立一种使用监督和无监督学习的灵活的近重复文档检测框架

A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

摘要

著录项

相似文献

相关主题

期刊订阅