The design of powerful learning methods for addressing huge amounts of unstructured data, such as text documents, is a fundamental problem within the document analysis and recognition community. In this work, we propose FlexRank, a specially designed bipartite ranking algorithm for text documents using lexicographical ordering. FlexRank is based on the area under the ROC curve (ROC AUC), which is a well-known metric to evaluate ranking and classification algorithms and to select features in text classification. In our proposal, we express the calculation of the exact increment of ROC AUC caused by each attribute inserted into a lexicographic model. Based on this calculation, FlexRank performs an internal feature selection using the area under the ROC curve to define its lexicographic ranker, which can speed up rankers by sorting instances in linear time complexity using most significant digit (MSD) radix sort. We empirically evaluated FlexRank against a range of text datasets and compared its speed and ROC AUC with that of the Support Vector Machines, Decision Trees, Naive Bayes, K-nearest neighbours, and LexRank. FlexRank was shown to be much faster than all the other methods, while retaining competitive ROC AUC performance.
展开▼