The optimal learner for prediction modeling varies depending on theunderlying data-generating distribution. Super Learner (SL) is a genericensemble learning algorithm that uses cross-validation to select among a"library" of candidate prediction models. The SL is not restricted to a singleprediction model, but uses the strengths of a variety of learning algorithms toadapt to different databases. While the SL has been shown to perform well in anumber of settings, it has not been thoroughly evaluated in large electronichealthcare databases that are common in pharmacoepidemiology and comparativeeffectiveness research. In this study, we applied and evaluated the performanceof the SL in its ability to predict treatment assignment using three electronichealthcare databases. We considered a library of algorithms that consisted ofboth nonparametric and parametric models. We also considered a novel strategyfor prediction modeling that combines the SL with the high-dimensionalpropensity score (hdPS) variable selection algorithm. Predictive performancewas assessed using three metrics: the negative log-likelihood, area under thecurve (AUC), and time complexity. Results showed that the best individualalgorithm, in terms of predictive performance, varied across datasets. The SLwas able to adapt to the given dataset and optimize predictive performancerelative to any individual learner. Combining the SL with the hdPS was the mostconsistent prediction method and may be promising for PS estimation andprediction modeling in electronic healthcare databases.
展开▼