Stacked Ensemble Learning for Propensity Score Methods in Observational Studies



Published Jun 30, 2021
Maximilian Autenrieth Richard A. Levine Juanjuan Fan Maureen A. Guarcello


Propensity score methods account for selection bias in observational studies. However, the consistency of the propensity score estimators strongly depends on a correct specification of the propensity score model. Logistic regression and, with increasing popularity, machine learning tools are used to estimate propensity scores. We introduce a stacked generalization ensemble learning approach to improve propensity score estimation by fitting a meta learner on the predictions of a suitable set of diverse base learners. We perform a comprehensive Monte Carlo simulation study, implementing a broad range of scenarios that mimic characteristics of typical data sets in educational studies. The population average treatment effect is estimated using the propensity score in Inverse Probability of Treatment Weighting. Our proposed stacked ensembles, especially using gradient boosting machines as a meta learner trained on a set of 12 base learner predictions, led to superior reduction of bias compared to the current state-of-the-art in propensity score estimation. Further, our simulations imply that commonly used balance measures (averaged standardized absolute mean differences) might be misleading as propensity score model selection criteria. We apply our proposed model - which we call GBM-Stack - to assess the population average treatment effect of a Supplemental Instruction (SI) program in an introductory psychology (PSY 101) course at San Diego State University. Our analysis provides evidence that moving the whole population to SI attendance would on average lead to 1.69 times higher odds to pass the PSY 101 class compared to not offering SI, with a 95% bootstrap confidence interval of (1.31, 2.20).

How to Cite

Autenrieth, M., Levine, R. A., Fan, J., & Guarcello, M. A. (2021). Stacked Ensemble Learning for Propensity Score Methods in Observational Studies. Journal of Educational Data Mining, 13(1), 24–189.
Abstract 722 | PDF Downloads 456



educational data mining, machine learning, ensemble learning, stacked generalization, propensity score estimation, causal inference

ALCOTT, B. 2017. Does teacher encouragement influence students’ educational progress? A propensity- score matching analysis. Research in Higher Education 58, 7, 773–804.

ALI, M. S., GROENWOLD, R. H., PESTMAN, W. R., BELITSER, S. V., ROES, K. C., HOES, A. W., DE BOER, A., AND KLUNGEL, O. H. 2014. Propensity score balance measures in pharmacoepidemiology: A simulation study. Pharmacoepidemiology and Drug Safety 23, 8, 802–811.

ALPAYDIN, E. 2014. Introduction to Machine Learning. MIT Press, Cambridge, MA.

AUSTIN, P. C. 2010. The performance of different propensity-score methods for estimating differences in proportions (risk differences or absolute risk reductions) in observational studies. Statistics in Medicine 29, 20, 2137–2148.

AUSTIN, P. C. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research 46, 3, 399–424.

AUSTIN, P. C. AND STUART, E. A. 2015. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine 34, 28, 3661–3679.

AUTENRIETH, M. 2018. Ensemble learning for propensity score methods on in observational studies. [Master’s thesis, San Diego State University], ProQuest Dissertations and Theses.

BAKKER, T. C., KRABBENDAM, L., BHULAI, S., AND BEGEER, S. 2020. First-year progression and retention of autistic students in higher education: A propensity score-weighted population study. Autism in Adulthood 2, 4, 307–316.

BEEMER, J., SPOON, K., HE, L., FAN, J., AND LEVINE, R. A. 2018. Ensemble learning for estimating individualized treatment effects in student success studies. International Journal of Artificial Intelligence in Education 28, 3, 315–335.

BRAND, J. E. AND XIE, Y. 2010. Who benefits most from college? Evidence for negative selection in heterogeneous economic returns to higher education. American Sociological Review 75, 2, 273–302.

BREIMAN, L. 1996. Stacked regressions. Machine Learning 24, 1, 49–64.

BUJA, A., STUETZLE, W., AND SHEN, Y. 2005. Loss functions for binary class probability estimation and classification: Structure and applications. Tech. rep., University of Pennsylvania.

CALIENDO, M. AND KOPEINIG, S. 2008. Some practical guidance for the implementation of propensity score matching. Journal of Economic Surveys 22, 1, 31–72.

CARUANA, E., CHEVRET, S., RESCHE-RIGON, M., AND PIRRACCHIO, R. 2015. A new weighted balance measure helped to select the variables to be included in a propensity score model. Journal of Clinical Epidemiology 68, 12, 1415–1422.

CLARK, M. AND CUNDIFF, N. L. 2011. Assessing the effectiveness of a college freshman seminar using propensity score adjustments. Research in Higher Education 52, 6, 616–639.

COLE, S. R. AND HERNÁN, M. A. 2008. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology 168, 6, 656–664.

DEHEJIA, R. H. AND WAHBA, S. 2002. Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics 84, 1, 151–161.

DOUMPOS, M. AND ZOPOUNIDIS, C. 2007. Model combination for credit risk assessment: A stacked generalization approach. Annals of Operations Research 151, 1, 289–306.

DRAKE, C. 1993. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics 49, 4, 1231–1236.

FEILD, J. L., LEWKOW, N., ZIMMERMAN, N. L., RIEDESEL, M., AND ESSA, A. 2016. A scalable learning analytics platform for automated writing feedback. In Proceedings of the 9th International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. International Educational Data Mining Society, 688–693.

FRANKLIN, J. M., RASSEN, J. A., ACKERMANN, D., BARTELS, D. B., AND SCHNEEWEISS, S. 2014. Metrics for covariate balance in cohort studies of causal effects. Statistics in Medicine 33, 10, 1685–1699.

FRIEDMAN, J. H. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29, 5, 1189–1232.

GOLINELLI, D., RIDGEWAY, G., RHOADES, H., TUCKER, J., AND WENZEL, S. 2012. Bias and variance trade-offs when combining propensity score weighting and regression: With an application to HIV status and homeless men. Health Services and Outcomes Research Methodology 12, 2-3, 104–118.

GRIFFIN, B. A., MCCAFFREY, D. F., ALMIRALL, D., BURGETTE, L. F., AND SETODJI, C. M. 2017. Chasing balance and other recommendations for improving nonparametric propensity score models. Journal of Causal Inference 5, 2, 1–18.

GUARCELLO, M. A., LEVINE, R. A., BEEMER, J., FRAZEE, J. P., LAUMAKIS, M. A., AND SCHELLENBERG, S. A. 2017. Balancing student success: Assessing supplemental instruction through coarsened exact matching. Technology, Knowledge and Learning 22, 3, 335–352.

HAINMUELLER, J. 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1, 25–46.

HARDER, V. S., STUART, E. A., AND ANTHONY, J. C. 2010. Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological Methods 15, 3, 234–249.

HIRANO, K., IMBENS, G. W., AND RIDDER, G. 2003. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 4, 1161–1189.

IMAI, K. AND RATKOVIC, M. 2014. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 1, 243–263.

IMAI, K. AND VAN DYK, D. A. 2004. Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association 99, 467, 854–866.

IMBENS, G. W. 2004. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics 86, 1, 4–29.

JIANG, F. AND MCCOMAS, W. F. 2015. The effects of inquiry teaching on student science achievement and attitudes: Evidence from propensity score analysis of PISA data. International Journal of Science Education 37, 3, 554–576.

JOFFE, M. M., TEN HAVE, T. R., FELDMAN, H. I., AND KIMMEL, S. E. 2004. Model selection, confounder control, and marginal structural models: Review and new applications. The American Statistician 58, 4, 272–279.

JU, C., SCHWAB, J., AND VAN DER LAAN, M. J. 2019. On adaptive propensity score truncation in causal inference. Statistical Methods in Medical Research 28, 6, 1741–1760.

KAM, C. D. AND PALMER, C. L. 2008. Reconsidering the effects of education on political participation. The Journal of Politics 70, 3, 612–631.

KAUERMANN, G. AND CARROLL, R. J. 2001. A note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association 96, 456, 1387–1396.

KIM, R. H. AND CLARK, D. 2013. The effect of prison-based college education programs on recidivism: Propensity score matching approach. Journal of Criminal Justice 41, 3, 196–204.

KUHN, M. 2008. Building predictive models in R using the caret package. Journal of Statistical Software, Articles 28, 5, 1–26.

LEBLANC, M. AND TIBSHIRANI, R. 1996. Combining estimates in regression and classification. Journal of the American Statistical Association 91, 436, 1641–1650.

LEE, B. K., LESSLER, J., AND STUART, E. A. 2010. Improving propensity score weighting using machine learning. Statistics in Medicine 29, 3, 337–346.

LEE, B. K., LESSLER, J., AND STUART, E. A. 2011. Weight trimming and propensity score weighting. Plos One 6, 3, e18174.

LI, F., MORGAN, K. L., AND ZASLAVSKY, A. M. 2018. Balancing covariates via propensity score weighting. Journal of the American Statistical Association 113, 521, 390–400.

LUMLEY, T. 2004. Analysis of complex survey samples. Journal of Statistical Software 9, 1, 1–19.

LUNCEFORD, J. K. AND DAVIDIAN, M. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine 23, 19, 2937–2960.

MCCAFFREY, D. F., RIDGEWAY, G., AND MORRAL, A. R. 2004. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 9, 4, 403– 425.

MICROSOFT CORPORATION AND WESTON, S. 2017. doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. R package version 1.0.11.

MORGAN, P. L., FRISCO, M. L., FARKAS, G., AND HIBEL, J. 2010. A propensity score matching analysis of the effects of special education services. The Journal of Special Education 43, 4, 236–254.

NAIMI, A. I. AND BALZER, L. B. 2018. Stacked generalization: An introduction to super learning. European Journal of Epidemiology 33, 5, 459–464.

PARAST, L., MCCAFFREY, D. F., BURGETTE, L. F., DE LA GUARDIA, F. H., GOLINELLI, D., MILES, J. N. V., AND GRIFFIN, B. A. 2017. Optimizing variance-bias trade-off in the TWANG package for estimation of propensity scores. Health Services and Outcomes Research Methodology 17, 3, 175–197.

PARDOS, Z. A., GOWDA, S. M., BAKER, R. S., AND HEFFERNAN, N. T. 2012. The sum is greater than the parts: Ensembling models of student knowledge in educational software. ACM SIGKDD Explorations Newsletter 13, 2, 37–44.

PELAEZ, K., LEVINE, R., FAN, J., GUARCELLO, M., AND LAUMAKIS, M. 2019. Using a latent class forest to identify at-risk students in higher education. Journal of Educational Data Mining 11, 1, 18–46.

PIRRACCHIO, R. AND CARONE, M. 2018. The Balance Super Learner: A robust adaptation of the Super Learner to improve estimation of the average treatment effect in the treated based on propensity score matching. Statistical Methods in Medical Research 27, 8, 2504–2518.

PIRRACCHIO, R., CARONE, M., RIGON, M. R., CARUANA, E., MEBAZAA, A., AND CHEVRET, S. 2016. Propensity score estimators for the average treatment effect and the average treatment effect on the treated may yield very different estimates. Statistical Methods in Medical Research 25, 5, 1938–1954.

PIRRACCHIO, R., PETERSEN, M. L., AND VAN DER LAAN, M. 2014. Improving propensity score estimators’ robustness to model misspecification using Super Learner. American Journal of Epidemiology 181, 2, 108–119.

POLIKAR, R. 2006. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine 6, 3, 21–45.

POLIKAR, R. 2007. Bootstrap - inspired techniques in computation intelligence. IEEE Signal Processing Magazine 24, 4, 59–72. R Core Team. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

RIDGEWAY, G., MCCAFFREY, D. F., MORRAL, A. R., BURGETTE, L. F., AND GRIFFIN, B. A. 2014. Toolkit for weighting and analysis of nonequivalent groups. Tech. rep., RAND Corporation.

ROJEWSKI, J. W., LEE, I. H., AND GREGG, N. 2015. Causal effects of inclusion on postsecondary education outcomes of individuals with high-incidence disabilities. Journal of Disability Policy Studies 25, 4, 210–219.

ROSENBAUM, P. R. 1987. Model-based direct adjustment. Journal of the American Statistical Association 82, 398, 387–394.

ROSENBAUM, P. R. AND RUBIN, D. B. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1, 41–55.

ROSENBAUM, P. R. AND RUBIN, D. B. 1984. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79, 387, 516–524.

RUBIN, D. B. 2001. Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology 2, 3, 169–188.

RUBIN, D. B. 2004. On principles for modeling propensity scores in medical research. Pharmacoepidemiology and Drug Safety 13, 12, 855–857.

SESMERO, M. P., LEDEZMA, A. I., AND SANCHIS, A. 2015. Generating ensembles of heterogeneous classifiers using stacked generalization. WIREs Data Mining and Knowledge Discovery 5, 1, 21–34.

SETOGUCHI, S., SCHNEEWEISS, S., BROOKHART, M. A., GLYNN, R. J., AND COOK, E. F. 2008. Evaluating uses of data mining techniques in propensity score estimation: A simulation study. Pharmacoepidemiology and Drug Safety 17, 6, 546–555.

SHAPIRO, J. AND TREVINO, J. M. 2004. Compensatory education for disadvantaged mexican students: An impact evaluation using propensity score matching. Policy Research Working Paper WPS3334, World Bank.

STONE, C. A. AND TANG, Y. 2013. Comparing propensity score methods in balancing covariates and recovering impact in small sample educational program evaluations. Practical Assessment, Research, and Evaluation 18, 13, 1–12.

STUART, E. A., LEE, B. K., AND LEACY, F. P. 2013. Prognostic score–based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness research. Journal of Clinical Epidemiology 66, 8, S84–S90.

SULLIVAN, A. L. AND FIELD, S. 2013. Do preschool special education services make a difference in kindergarten reading and mathematics skills?: A propensity score weighting analysis. Journal of School Psychology 51, 2, 243–260.

TING, K. M. AND WITTEN, I. H. 1999. Issues in stacked generalization. Journal of Artificial Intelligence Research 10, 1 (May), 271–289.

TITUS, M. A. 2007. Detecting selection bias, using propensity score matching, and estimating treatment effects: An application to the private returns to a master’s degree. Research in Higher Education 48, 4, 487–521.

TORRES-SOSPEDRA, J., HERNÁNDEZ-ESPINOSA, C., AND FERNÁNDEZ-REDONDO, M. 2006. Combining MF networks: A comparison among statistical methods and stacked generalization. In Artificial Neural Networks in Pattern Recognition, F. Schwenker and S. Marinai, Eds. Springer Berlin Heidelberg, Berlin, Heidelberg, 210–220.

UMKC 2018. Supplemental intruction., accessed July 2018.

WESTREICH, D., COLE, S. R., FUNK, M. J., BROOKHART, M. A., AND STÜRMER, T. 2011. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiology and Drug Safety 20, 3, 317–320.

WESTREICH, D., LESSLER, J., AND FUNK, M. J. 2010. Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal of Clinical Epidemiology 63, 8, 826 – 833.

WILLIAMSON, E., MORLEY, R., LUCAS, A., AND CARPENTER, J. 2012. Propensity scores: From naive enthusiasm to intuitive understanding. Statistical Methods in Medical Research 21, 3, 273–293.

WOLPERT, D. H. 1992. Stacked generalization. Neural Networks 5, 2, 241 – 259.

ZADROZNY, B. AND ELKAN, C. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, C. E. Brodley and A. P. Danyluk, Eds. Morgan Kaufmann, 609–616.

Most read articles by the same author(s)