Predictive and Explanatory Models Might Miss Informative Features in Educational Data

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Dec 28, 2021
Nicholas T. Young Marcos D. Caballero

Abstract

We encounter variables with little variation often in educational data mining (EDM) due to the demographics of higher education and the questions we ask. Yet, little work has examined how to analyze such data. Therefore, we conducted a simulation study using logistic regression, penalized regression, and random forest. We systematically varied the fraction of positive outcomes, feature imbalances, and odds ratios. We find the algorithms treat features with the same odds ratios differently based on the features' imbalance and the outcome imbalance. While none of the algorithms fully solved how to handle imbalanced data, penalized approaches such as Firth and Log-F reduced the difference between the built-in odds ratio and value determined by the algorithm. Our results suggest that EDM studies might contain false negatives when determining which variables are related to an outcome. We then apply our findings to a graduate admissions dataset. We end by proposing recommendations that researchers should consider penalized regression for datasets on the order of hundreds of cases and should include more context about their data in publications such as the outcome and feature imbalances.

How to Cite

Young, N. T., & Caballero, M. D. (2021). Predictive and Explanatory Models Might Miss Informative Features in Educational Data. Journal of Educational Data Mining, 13(4), 31–86. https://doi.org/10.5281/zenodo.5806830
Abstract 443 | PDF Downloads 351

##plugins.themes.bootstrap3.article.details##

Keywords

random forest, penalized regression, feature imbalance, outcome imbalance

References
AIKEN, J. M., DE BIN, R., LEWANDOWSKI, H., AND CABALLERO, M. D. 2021. Framework for evaluating statistical models in physics education research. Physical Review Physics Education Research 17, 2 (July), 020104.

ALTMANN, A., TOLOSI, L., SANDER, O., AND LENGAUER, T. 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 10 (May), 1340–1347.

ARAUJO, M. B., PEARSON, R. G., THUILLER, W., AND ERHARD, M. 2005. Validation of species–climate impact models under climate change. Global Change Biology 11, 9 (Sept.), 1504–1513.

ARREOLA, E. V. AND WILSON, J. R. 2020. Bayesian multiple membership multiple classification logistic regression model on student performance with random effects in university instructors and majors. PLOS ONE 15, 1 (Jan.), e0227343.

AULCK, L. S., NAMBI, D., AND WEST, J. 2020. Increasing enrollment by optimizing scholarship allocations using machine learning and genetic algorithms. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 29–38.

AUSTIN, P. C. AND STEYERBERG, E. W. 2017. Events per variable (epv) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Statistical Methods in Medical Research 26, 2 (Apr.), 796–808.

AUTENRIETH, M., LEVINE, R. A., FAN, J., AND GUARCELLO, M. A. 2021. Stacked ensemble learning for propensity score methods in observational studies. Journal of Educational Data Mining 13, 1(June), 24–189.

BOSCH, N., CRUES, R. W., SHAIK, N., AND PAQUETTE, L. 2020. “Hello, [REDACTED]”: Protecting student privacy in analyses of online discussion forums. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 39–49.

BOULESTEIX, A.-L., BENDER, A., LORENZO BERMEJO, J., AND STROBL, C. 2012. Random forest Gini importance favours SNPs with large minor allele frequency: Impact, sources and recommendations. Briefings in Bioinformatics 13, 3 (May), 292–304.

BOULESTEIX, A.-L., LAUER, S., AND EUGSTER, M. J. A. 2013. A plea for neutral comparison studies in computational sciences. PLOS ONE 8, 4 (Apr.), e61562.

BREIMAN, L. 2001a. Random forests. Machine Learning 45, 1 (Oct.), 5–32.

BREIMAN, L. 2001b. Statistical modeling: The two cultures. Statistical Science 16, 3 (Aug.), 199 – 231.

BREIMAN, L., FRIEDMAN, J., STONE, C. J., AND OLSHEN, R. A. 1984. Classification and regression trees. CRC Press.

BULATHWELA, S., PEREZ-ORTIZ, M., LIPANI, A., YILMAZ, E., AND SHAWE-TAYLOR, J. 2020. Predicting engagement in video lectures. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 50–60.

CHAWLA, N. V. 2009. Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer, Boston, MA, 875–886.

CHEN, C., LIAW, A., AND BREIMAN, L. 2004. Using random forest to learn imbalanced data. Tech. Rep. 66, University of California, Berkley. July.

CHEN, F. AND CUI, Y. 2020. LogCF: Deep collaborative filtering with process data for enhanced learning outcome modeling. Journal of Educational Data Mining 12, 4 (Dec.), 66–99.

CHEN, X., LIU, C.-T., ZHANG, M., AND ZHANG, H. 2007. A forest-based approach to identifying gene and gene–gene interactions. Proceedings of the National Academy of Sciences 104, 49, 19199–19203.

CHIU, M.-S. 2020. Gender differences in predicting STEM choice by affective states and behaviors in online mathematical problem solving: Positive-affect-to-success hypothesis. Journal of Educational Data Mining 12, 2 (Aug.), 48–77.

COPAS, J. B. 1983. Regression, prediction and shrinkage. Journal of the Royal Statistical Society. Series B (Methodological) 45, 3, 311–354.

COURONNE, R., PROBST, P., AND BOULESTEIX, A.-L. 2018. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics 19, 1 (July), 270.

COURVOISIER, D. S., COMBESCURE, C., AGORITSAS, T., GAYET-AGERON, A., AND PERNEGER,T. V. 2011. Performance of logistic regression modeling: Beyond the number of events per variable, the role of data structure. Journal of Clinical Epidemiology 64, 9 (Sept.), 993–1000.

CULE, E., VINEIS, P., AND DE IORIO, M. 2011. Significance testing in ridge regression for genetic data. BMC Bioinformatics 12, 1 (Sept.), 372.

DING, L. 2019. Theoretical perspectives of quantitative physics education research. Physical Review Physics Education Research 15, 2 (July), 020101.

DOERKEN, S., AVALOS, M., LAGARDE, E., AND SCHUMACHER, M. 2019. Penalized logistic regression with low prevalence exposures beyond high dimensional settings. PLOS ONE 14, 5 (May),e0217057.

DU PREL, J.-B., HOMMEL, G., ROHRIG, B., AND BLETTNER, M. 2009. Confidence interval or p-value? Deutsches Arzteblatt International 106, 19 (May), 335–339.

DIAZ-URIARTE, R. AND ALVAREZ DE ANDRES, S. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 1 (Jan.), 3.

EFRON, B. 1982. The Jackknife, the Bootstrap and Other Resampling Plans. CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, Philadelphia, PA.

ENSOY, C., RAKHMAWATI, T. W., FAES, C., AND AERTS, M. 2015. Separation issues and possible solutions: Part I – Systematic literature review on logistic models - Part II – Comparison of different methods for separation in logistic regression. EFSA Supporting Publications 12, 9, 869E.

FIRTH, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80, 1, 27–38.

FRIEDMAN, J. H., HASTIE, T., AND TIBSHIRANI, R. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1 (Feb.), 1–22.

GELMAN, A. AND HILL, J. 2006. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

GELMAN, A., JAKULIN, A., PITTAU, M. G., AND SU, Y.-S. 2008. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2, 4 (Dec.), 1360–1383.

GELMAN, A. AND SU, Y.-S. 2020. arm: Data Analysis Using Regression and Multilevel/HierarchicalModels. https://cran.r-project.org/package=arm.

GREENBERG, E. AND PARKS, R. P. 1997. A predictive approach to model selection and multicollinearity. Journal of Applied Econometrics 12, 1, 67–75.

GREENLAND, S. AND MANSOURNIA, M. A. 2015. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in Medicine 34, 23, 3133–3143.

GREENLAND, S., MANSOURNIA, M. A., AND ALTMAN, D. G. 2016. Sparse data bias: A problem hiding in plain sight. British Medical Journal 352, i1981.

HAPFELMEIER, A. AND ULM, K. 2013. A new variable selection approach using random forests. Computational Statistics & Data Analysis 60, 50–69.

HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2009. The Elements of Statistical Learning: DataMining, Inference, and Prediction, 2nd ed. Springer Science & Business Media, New York.

HEINZE, G. AND SCHEMPER, M. 2002. A solution to the problem of separation in logistic regression. Statistics in Medicine 21, 16, 2409–2419.

HEYDE, C. C. 2014. Central limit theorem. In Wiley StatsRef: Statistics Reference Online. John Wiley& Sons, Ltd.

HOFNER, B., BOCCUTO, L., AND G¨OKER, M. 2015. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics 16, 1 (May), 144.

HOLM, S. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 2, 65–70.

HOOKER, G., MENTCH, L., AND ZHOU, S. 2021. Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing 31, 82.

HOTHORN, T., HORNIK, K., AND ZEILEIS, A. 2006. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15, 3 (Sept.), 651–674.

HUNT, T. 2020. ModelMetrics: Rapid Calculation of Model Metrics. https://cran.r-project.org/package=ModelMetrics.

HUR, P., BOSCH, N., PAQUETTE, L., AND MERCIER, E. 2020. Harbingers of collaboration? The role of early-class behaviors in predicting collaborative problem solving. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 104 – 114.

JANITZA, S., CELIK, E., AND BOULESTEIX, A.-L. 2016. A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification 12, 4(Nov.), 885–915.

JANITZA, S., STROBL, C., AND BOULESTEIX, A.-L. 2013. An AUC-based permutation variable importance measure for random forests. BMC Bioinformatics 14, 119.

JEFFREYS, H. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 186, 1007 (Sept.),453–461.

KANIM, S. AND CID, X. C. 2020. Demographics of physics education research. Physical Review Physics Education Research 16, 2 (July), 020106.

KARP, I. 2014. Re: “Estimating the relative risk in cohort studies and clinical trials of common outcomes”. American Journal of Epidemiology 179, 8 (Apr.), 1034–1035.

KIM, H., KO, T., PARK, N.-W., AND LEE, W. 2014. Comparison of bias correction methods for the rare event logistic regression. Korean Journal of Applied Statistics 27, 2 (Apr.), 277–290.

KING, G. AND ZENG, L. 2001. Logistic regression in rare events data. Political Analysis 9, 2, 137–163.

KIRASICH, K. 2018. Random forest vs logistic regression: Binary classification for heterogeneous datasets. SMU Data Science Review 1, 3, 9.

KNAUB, A. V., AIKEN, J. M., AND DING, L. 2019. Two-phase study examining perspectives and use of quantitative methods in physics education research. Physical Review Physics Education Research 15, 2 (July), 020102.

KOSMIDIS, I. 2020. brglm: Bias Reduction in Binary-Response Generalized Linear Models. https://cran.r-project.org/package=brglm.

KOSMIDIS, I. AND FIRTH, D. 2020. Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika 108, 1 (Aug.), 71–82.

KRUZICEVIC, S. M., BARISIC, K. J., BANOZIC, A., ESTEBAN, C. D., SAPUNAR, D., AND PULJAK,L. 2012. Predictors of attrition and academic success of medical students: A 30-year retrospective study. PLOS ONE 7, 6 (June), e39144.

KUHN, M. 2020. caret: Classification and Regression Training. https://CRAN. R-project.org/package=caret.

KOROSI, G., ESZTELECKI, P., FARKAS, R., AND TOTH, K. 2018. Clickstream-based outcome prediction in short video MOOCs. In 2018 International Conference on Computer, Information and Telecommunication Systems (CITS). 1–5.

LEE, J. D., SUN, D. L., SUN, Y., AND TAYLOR, J. E. 2016. Exact post-selection inference, with application to the lasso. Annals of Statistics 44, 3 (June), 907–927.

LI, T. W. AND PAQUETTE, L. 2020. Erroneous answers categorization for sketching questions in spatial visualization training. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International EducationalData Mining Society, 148 – 158.

LIAW, A. AND WIENER, M. 2002. Classification and regression by randomForest. R News 2, 3, 18–22.

LOCKHART, R., TAYLOR, J., TIBSHIRANI, R. J., AND TIBSHIRANI, R. 2014. A significance test for the lasso. Annals of Statistics 42, 2 (Apr.), 413–468.

LOH, W.-Y. 2002. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica 12, 2, 361–386.

LOH, W.-Y. 2009. Improving the precision of classification trees. The Annals of Applied Statistics 3, 4(Dec.), 1710–1737.

LOH, W.-Y. AND ZHOU, P. 2021. Variable importance scores. Journal of Data Science 19, 4, 569–592.

MANDUCA, C. A., IVERSON, E. R., LUXENBERG, M., MACDONALD, R. H., MCCONNELL, D. A., MOGK, D. W., AND TEWKSBURY, B. J. 2017. Improving undergraduate STEM education: Theefficacy of discipline-based professional development. Science Advances 3, 2 (Feb.), e1600193.

MCFADDEN, D. 1977. Quantitative methods for analyzing travel behaviour of individuals: Some recent developments. Tech. Rep. 474, Cowles Foundation for Research in Economics, Yale University.

MCNUTT, L.-A., WU, C., XUE, X., AND HAFNER, J. P. 2003. Estimating the relative risk in cohort studies and clinical trials of common outcomes. American Journal of Epidemiology 157, 10 (May),940–943.

MEINSHAUSEN, N. AND B¨UHLMANN, P. 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (Sept.), 417–473.

MENARD, S. 2000. Coefficients of determination for multiple logistic regression analysis. The American Statistician 54, 1 (Feb.), 17–24.

MENZE, B. H., KELM, B. M., SPLITTHOFF, D. N., KOETHE, U., AND HAMPRECHT, F. A. 2011. Onoblique random forests. In Machine Learning and Knowledge Discovery in Databases, D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, Eds. Lecture Notes in Computer Science. Springer,Berlin, Heidelberg, 453–469.

MOLNAR, C., K¨ONIG, G., BISCHL, B., AND CASALICCHIO, G. 2020. Model-agnostic feature importance and effects with dependent features – a conditional subgroup approach. arXiv:2006.04628 [cs,stat].

MU, T., JETTEN, A., AND BRUNSKILL, E. 2020. Towards suggesting actionable interventions for wheel spinning students. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 183 – 193.

MARQUEZ-VERA, C., CANO, A., ROMERO, C., AND VENTURA, S. 2013. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence 38, 3 (Apr.), 315–330.

NAMOUN, A. AND ALSHANQITI, A. 2021. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Applied Sciences 11, 1, 237.

NEMBRINI, S., K¨ONIG, I. R., AND WRIGHT, M. N. 2018. The revival of the Gini importance? Bioinformatics 34, 21 (Nov.), 3711–3718.

NEMES, S., JONASSON, J. M., GENELL, A., AND STEINECK, G. 2009. Bias in odds ratios by logistic regression modeling and sample size. BMC Medical Research Methodology 9, 56.

NGUYEN, Q., POQUET, O., BROOKS, C., AND LI, W. 2020. Exploring homophily in demographics and academic performance using spatial-temporal student networks. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 194 – 201.

NICODEMUS, K. K. 2011. Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Briefings in Bioinformatics 12, 4 (July), 369–373.

OGUNDIMU, E. O. 2019. Prediction of default probability by using statistical models for rare events. Journal of the Royal Statistical Society: Series A (Statistics in Society) 182, 4, 1143–1162.

OJALA, M. AND GARRIGA, G. C. 2010. Permutation tests for studying classifier performance. Journal of Machine Learning Research 11, 1833–1863.

OLIVIER, J. AND BELL, M. L. 2013. Effect sizes for 2×2 contingency tables. PLoS ONE 8, 3 (Mar.),e58777.

OLMUS, H., NAZMAN, E., AND ERBAS, S. 2019. Comparison of penalized logistic regression models for rare event case. Communications in Statistics - Simulation and Computation, 1–13.

PANGASTUTI, S. S., FITHRIASARI, K., IRIAWAN, N., AND SURYANINGTYAS, W. 2021. Data miningapproach for educational decision support. EKSAKTA: Journal of Sciences and Data Analysis 2, 1(Feb.), 33–44.

PAQUETTE, L., OCUMPAUGH, J., LI, Z., ANDRES, A., AND BAKER, R. 2020. Who’s learning? Using demographics in EDM research. Journal of Educational Data Mining 12, 3 (Oct.), 1–30.

PAVLOU, M., AMBLER, G., SEAMAN, S., DE IORIO, M., AND OMAR, R. Z. 2016. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Statistics in Medicine 35, 7, 1159–1177.

PAVLOU, M., AMBLER, G., SEAMAN, S. R., GUTTMANN, O., ELLIOTT, P., KING, M., AND OMAR, R. Z. 2015. How to develop a more accurate risk prediction model when there are few events. BMJ 351, h3868.

PEDUZZI, P., CONCATO, J., KEMPER, E., HOLFORD, T. R., AND FEINSTEIN, A. R. 1996. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 49, 12 (Dec.), 1373–1379.

PELAEZ, K., LEVINE, R., FAN, J., GUARCELLO, M., AND LAUMAKIS, M. 2019. Using a latent class forest to identify at-risk students in higher education. Journal of Educational Data Mining 11, 1(June), 18–46.

PENA-AYALA, A. 2014. Educational data mining: A survey and a data mining-based analysis of recent works. Expert Systems with Applications 41, 4, Part 1 (Mar.), 1432–1462.

POSSELT, J., HERNANDEZ, T., COCHRAN, G., AND MILLER, C. 2019. Metrics first, diversity later? Making the shortlist and getting admitted to physics PhD programs. Journal of Women and Minorities in Science and Engineering 25, 4, 283–306.

PROBST, P., WRIGHT, M. N., AND BOULESTEIX, A.-L. 2019. Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, 3, e1301.

PUHR, R., HEINZE, G., NOLD, M., LUSA, L., AND GEROLDINGER, A. 2017. Firth’s logistic regression with rare events: Accurate effect estimates and predictions? Statistics in Medicine 36, 14 (June), 2302–2317.

R CORE TEAM. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
RAHMAN, M. S. AND SULTANA, M. 2017. Performance of Firth- and logF-type penalized methods in risk prediction for small or sparse binary data. BMC Medical Research Methodology 17, 1, 33.

ROMERO, C. AND VENTURA, S. 2020. Educational data mining and learning analytics: An updated survey. WIREs Data Mining and Knowledge Discovery 10, 3, e1355.

ROVIRA, S., PUERTAS, E., AND IGUAL, L. 2017. Data-driven system to predict academic grades and dropout. PLOS ONE 12, 2 (Feb.), e0171207.

SANYAL, D., BOSCH, N., AND PAQUETTE, L. 2020. Feature selection metrics: Similarities, differences, and characteristics of the selected models. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 212 – 223.

SAX, L. J., LEHMAN, K. J., BARTHELEMY, R. S., AND LIM, G. 2016. Women in physics: A comparison to science, technology, engineering, and math education over four decades. Physical ReviewPhysics Education Research 12, 2 (Aug.), 020108.

SHAFER, D., MAHMOOD, M. S., AND STELZER, T. 2021. Impact of broad categorization on statistical results: How underrepresented minority designation can mask the struggles of both Asian American and African American students. Physical Review Physics Education Research 17, 1 (Mar.), 010113.

SHMUELI, G. 2010. To explain or to predict? Statistical Science 25, 3 (Aug.), 289–310.

SIGNORELL, A. 2020. DescTools: Tools for Descriptive Statistics. https://cran.r-project.org/package=DescTools.

SPOON, K., BEEMER, J., WHITMER, J. C., FAN, J., FRAZEE, J. P., STRONACH, J., BOHONAK, A. J., AND LEVINE, R. A. 2016. Random forests for evaluating pedagogy and informing personalized learning. Journal of Educational Data Mining 8, 2 (Dec.), 20–50.

SPRINGUEL, R. P., WITTMANN, M. C., AND THOMPSON, J. R. 2019. Reconsidering the encoding of data in physics education research. Physical Review Physics Education Research 15, 2 (July), 020103.

STROBL, C., BOULESTEIX, A.-L., KNEIB, T., AUGUSTIN, T., AND ZEILEIS, A. 2008. Conditional variable importance for random forests. BMC Bioinformatics 9, 307.

STROBL, C., BOULESTEIX, A.-L., ZEILEIS, A., AND HOTHORN, T. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8, 25.

SVETNIK, V., LIAW, A., TONG, C., CULBERSON, J. C., SHERIDAN, R. P., AND FEUSTON, B. P.2003. Random forest: A classification and regression tool for compound classification and QSARmodeling. Journal of Chemical Information and Computer Sciences 43, 6 (Nov.), 1947–1958.

TAI, R. H., KONG, X., MITCHELL, C. E., DABNEY, K. P., READ, D. M., JEFFE, D. B., ANDRIOLE, D. A., AND WATHINGTON, H. D. 2017. Examining summer laboratory research apprenticeships for high school students as a factor in entry to MD/PhD programs at matriculation. CBE—Life SciencesEducation 16, 2 (June), ar37.

THEOBALD, E. J., AIKENS, M., EDDY, S., AND JORDT, H. 2019. Beyond linear regression: A reference for analyzing common data types in discipline based education research. Physical Review Physics Education Research 15, 2 (July), 020110.

VAN CALSTER, B., VAN SMEDEN, M., DE COCK, B., AND STEYERBERG, E. W. 2020. Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study. Statistical Methods in Medical Research 29, 11 (Nov.), 3166–3178.

VAN SMEDEN, M., DE GROOT, J. A. H., MOONS, K. G. M., COLLINS, G. S., ALTMAN, D. G.,EIJKEMANS, M. J. C., AND REITSMA, J. B. 2016. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Medical Research Methodology 16, 163.

VAN SMEDEN, M., MOONS, K. G., DE GROOT, J. A., COLLINS, G. S., ALTMAN, D. G., EIJKEMANS,M. J., AND REITSMA, J. B. 2019. Sample size for binary logistic prediction models: Beyond events per variable criteria. Statistical Methods in Medical Research 28, 8 (Aug.), 2455–2474.

WANG, M., CHEN, X., AND ZHANG, H. 2010. Maximal conditional chi-square importance in random forests. Bioinformatics 26, 6 (Mar.), 831–837.

WOLPERT, D. H. AND MACREADY, W. G. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1, 1 (Apr.), 67–82.

W˚ALINDER, A. 2014. Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis. Ph. D. thesis, Linnaeus University: Sweden.

YOUNG, N. T., ALLEN, G., AIKEN, J. M., HENDERSON, R., AND CABALLERO, M. D. 2019. Identifying features predictive of faculty integrating computation into physics courses. Physical ReviewPhysics Education Research 15, 1 (Feb.), 010114.

YOUNG, N. T. AND CABALLERO, M. D. 2020. Using machine learning to understand physics graduate school admissions. In 2019 Physics Education Research Conference Proceedings, Y. Cao, S. Wolf, and M. B. Bennet, Eds. American Association of Physics Teachers, 669–674.

ZHANG, J. AND YU, K. F. 1998. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 280, 19 (Nov.), 1690–1691.

ZHAO, Y., XU, Q., CHEN, M., AND WEISS, G. 2020. Predicting student performance in a master’s program in data science using admissions data. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 325 – 333.

ZOU, H. AND HASTIE, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2, 301–320.

SINKOVEC, H., HEINZE, G., BLAGUS, R., AND GEROLDINGER, A. 2021. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Medical Research Methodology 21, 1 (Sept.), 199.
Section
Articles