Web is commonly used medium to search informa-tion using Web Crawler. Web crawler fetches different pages related to given keyword but some of them contains duplicate content. Different URLs with similar text are DUST. To im-prove performance of search engine, DUSTER method is used. DUSTER detects and removes duplicate URLs without fetching their contents. Single crawler crawls single URL at a time. Multiple URLs are crawled parallally by Parallel crawlers and the results of parallel cralwlers are combined and given as a input to the DUSTER. Multiple sequence alignment is used to generate candidate rules and rules of validation. Then the candidate rules filtered out according to their performance in a validation set and finally removes the duplicate URLs. Using this method reduction of large number of duplicate URLs is achieved.
展开▼