The use of open bug tracking repositories like Bugzilla is common in many software applications. They allow developers, testers and users the ability to report problems associated with the system and track resolution status. Open and democratic reporting tools, however, face one major challenge: users can, and often do, submit reports describing the same problem. Research in duplicate report detection has primarily focused on word frequency based similarity measures paying little regard to the context or structure of the reporting language. Thus, in large repositories, reports describing different issues may be marked as duplicates due to the frequent use of common words. In this paper, we present Factor LCS, a methodology which utilizes common sequence matching for duplicate report detection. We demonstrate the approach by analyzing the complete Fire fox bug repository up until March 2012 as well as a smaller subset of Eclipse dataset from January 1, 2008 to December 31, 2008. We achieve a duplicate recall rate above 70% with Fire fox, which exceeds the results reported on smaller subsets of the same repository.
展开▼