Machine learning approaches to multi-label document classification have todate largely relied on discriminative modeling techniques such as supportvector machines. A drawback of these approaches is that performance rapidlydrops off as the total number of labels and the number of labels per documentincrease. This problem is amplified when the label frequencies exhibit the typeof highly skewed distributions that are often observed in real-world datasets.In this paper we investigate a class of generative statistical topic models formulti-label documents that associate individual word tokens with differentlabels. We investigate the advantages of this approach relative todiscriminative models, particularly with respect to classification problemsinvolving large numbers of relatively rare labels. We compare the performanceof generative and discriminative approaches on document labeling tasks rangingfrom datasets with several thousand labels to datasets with tens of labels. Theexperimental results indicate that probabilistic generative models can achievecompetitive multi-label classification performance compared to discriminativemethods, and have advantages for datasets with many labels and skewed labelfrequencies.
展开▼