Clustering has become an increasingly important task in modem application domains such as marketing and purchasing assistance, multimedia, molecular biology as well as many others. In most of these ureas, the dala are originally collected at different sites. In order to extract information from these dala, they are merged at a central site and then clustered. In this paper, we propose a different approach. We cluster the data locally and extract suitable representatives from these clusters. These representatives are sent to a global server site where we restore the complete cluster-ing based on the local representatives. This approach is very efficient, because the local clustering can be carried out quickly and independently i'rom each other. Furthermore, we have low transmission cost, as the number of transmitted representatives is much smaller than the cardinality of the complete dala set. Based on this small number of representatives, the global clustering can be done very efficiently. For both the local and the global clustering, we use a density based clustering algorithm. The combination of both the local and the global clustering forms our new DBDC (Density Based Distributed Clustering) algorithm. Furthermore, we discuss the complex problem of finding a suitable quality measure for evaluating distributed clusterings. We introduce two quality criteria which are compared to each other and which allow us to evaluate the quality of our DBDC algorithm. In our experimental evaluation, we will show that we do not have to sacrifice clustering quality in order to gain an efficiency advantage when using our distributed clustering approach.
展开▼