The problem of identifying approximate similarity between pair of strings is an essential step for data cleansing and data integration process. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity potential duplicate. But existing system does not produce the similarity percentage between pair of strings. In this paper we propose a method using entropy and information gain (IG) to find dissimilarity between pair of strings to increase the accuracy of data.
展开▼