Crowdsourcing utilizes human ability by distributing tasks to a large number of workers. It is especially suitable for solving data clustering problems because it provides a way to obtain a similarity measure between objects based on manual annotations, which capture the human perception of similarity among objects. This is in contrast to most clustering algorithms that face the challenge of finding an appropriate similarity measure for the given dataset. Several algorithms have been developed for crowdclustering that combine partial clustering results, each obtained by annotations provided by a different worker, into a single data partition. However, existing crowdclustering approaches require a large number of annotations, due to the noisy nature of human annotations, leading to a high computational cost in addition to the large cost associated with annotation. We address this problem by developing a novel approach for crowclustering that exploits the technique of matrix completion. Instead of using all the annotations, the proposed algorithm constructs a partially observed similarity matrix based on a subset of pairwise annotation labels that are agreed upon by most annotators. It then deploys the matrix completion algorithm to complete the similarity matrix and obtains the final data partition by applying a spectral clustering algorithm to the completed similarity matrix. We show, both theoretically and empirically, that the proposed approach needs only a small number of manual annotations to obtain an accurate data partition. In effect, we highlight the trade-off between a large number of noisy crowdsourced labels and a small number of high quality labels.
展开▼