Machine learning provides tools for automated construction of predictivemodels in data intensive areas of engineering and science. The family ofregularized kernel methods have in the recent years become one of the mainstreamapproaches to machine learning, due to a number of advantages themethods share. The approach provides theoretically well-founded solutionsto the problems of under- and overfitting, allows learning from structureddata, and has been empirically demonstrated to yield high predictive performanceon a wide range of application domains. Historically, the problemsof classification and regression have gained the majority of attention in thefield. In this thesis we focus on another type of learning problem, that oflearning to rank.In learning to rank, the aim is from a set of past observations to learna ranking function that can order new objects according to how well theymatch some underlying criterion of goodness. As an important special caseof the setting, we can recover the bipartite ranking problem, correspondingto maximizing the area under the ROC curve (AUC) in binary classification.Ranking applications appear in a large variety of settings, examplesencountered in this thesis include document retrieval in web search, recommendersystems, information extraction and automated parsing of naturallanguage. We consider the pairwise approach to learning to rank, whereranking models are learned by minimizing the expected probability of rankingany two randomly drawn test examples incorrectly. The developmentof computationally efficient kernel methods, based on this approach, has inthe past proven to be challenging. Moreover, it is not clear what techniquesfor estimating the predictive performance of learned models are the mostreliable in the ranking setting, and how the techniques can be implementedefficiently.The contributions of this thesis are as follows. First, we developRankRLS, a computationally efficient kernel method for learning to rank,that is based on minimizing a regularized pairwise least-squares loss. Inaddition to training methods, we introduce a variety of algorithms for taskssuch as model selection, multi-output learning, and cross-validation, basedon computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm,which is one of the most well established methods for learning torank. Third, we study the combination of the empirical kernel map and reducedset approximation, which allows the large-scale training of kernel machinesusing linear solvers, and propose computationally efficient solutionsto cross-validation when using the approach. Next, we explore the problemof reliable cross-validation when using AUC as a performance criterion,through an extensive simulation study. We demonstrate that the proposedleave-pair-out cross-validation approach leads to more reliable performanceestimation than commonly used alternative approaches. Finally, we presenta case study on applying machine learning to information extraction frombiomedical literature, which combines several of the approaches consideredin the thesis. The thesis is divided into two parts. Part I provides the backgroundfor the research work and summarizes the most central results, PartII consists of the five original research articles that are the main contributionof this thesis.
展开▼