CXustering mechanism helps to organise a large amount of data items by grouping the similar items into meaningful clusters. A successful clustering approach depends on an effective similarity search algorithm. Similarity search problem for text documents can be turned into a problem domain of sets by using the method called “shingling”. Characteristic matrix of sets is created by searching the shingles in each document. That's why the complexity to build the matrix is significantly high when the dataset is very large in size. Search time can be radically lessened if the characteristic matrix is made by utilizing Bloom Filter which reduces the search time to a constant time. Finding the similarity among all pairs of the set is a major issue since it takes O(n2) time to compare n sets. Locality-sensitive Hashing drastically diminishes the time complexity of searching by generating candidate pairs. Locality-sensitive Hashing focuses the similarity search on candidate pairs that are most likely to be similar. In this paper, the scheme for Bengali news article clustering based on the similarity search by Locality-sensitive Hashing(LSH) is presented.
展开▼