Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting NearDuplicates is very difficult in large collection of data like ”internet”. The presence of these web pagesplays an important role in the performance degradation while integrating data from heterogeneoussources. These pages either increase the index storage space or increase the serving costs. Detecting thesepages has many potential applications for example may indicate plagiarism or copyright infringement.This paper concerns detecting, and optionally removing duplicate and near duplicate documents which areused to perform clustering of documents .We demonstrated our approach in web news articles domain. Theexperimental results show that our algorithm outperforms in terms of similarity measures. The nearduplicate and duplicate document identification has resulted reduced memory in repositories.
展开▼