Using a Latent Class Forest to Identify At-Risk Students in Higher Education



Published Jun 16, 2019
Kevin Pelaez Richard Levine Juanjuan Fan Maureen Guarcello Mark Laumakis


Higher education institutions often examine performance discrepancies of specific subgroups, such as students from underrepresented minority and first-generation backgrounds. An increase in educational technology and computational power has promoted research interest in using data mining tools to help identify groups of students who are academically at-risk. Institutions can then implement data-informed decisions to help promote student access, increase retention and graduation rates, and guide intervention programs. We introduce a latent class forest, a latent class analysis and a random forest ensemble that will recursively partition observations into groups to help identify at-risk students. The procedure is a form of model-based hierarchical clustering that relies on latent class trees to optimally identify subgroups. We motivate and apply our latent class forest method to identify key demographic and academic characteristics of at-risk students in a large enrollment, bottleneck introductory psychology course at San Diego State University (SDSU). A post hoc analysis is conducted to measure the efficacy of Supplemental Instruction (SI) across these groups. SI is a peer-led academic intervention that targets historically challenging courses and aims to increase student performance. In doing so, we are able to identify populations that benefit most from SI to guide program recruitment and help increase the introductory psychology course success rate.

How to Cite

Pelaez, K., Levine, R., Fan, J., Guarcello, M., & Laumakis, M. (2019). Using a Latent Class Forest to Identify At-Risk Students in Higher Education. Journal of Educational Data Mining, 11(1), 18–46.
Abstract 996 | PDF Downloads 1089



Latent Class Analysis, supplemental instruction, clustering, at-risk students, Latent Class Forest

BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 1, 3–17.

BRAXTON, J. M. 2000. Reworking the student departure puzzle. Vanderbilt University Press, 2000.

BREIMAN, L. 2001. Random forests. Machine Learning 45, 1, 5–32.

BRUSCO, M. J., SHIREMAN, E., AND STEINLEY, D. 2016. A comparison of latent class, k-means, and k-median methods for clustering dichotomous data. Psychological methods 22, 3, 563.

CARUANA, R., KARAMPATZIAKIS, N., AND YESSENALINA, A. 2008. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning. ACM, 96–103.

CARUANA, R. AND NICULESCU-MIZIL, A. 2006. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 161–168.

CASELLA, G. AND BERGER, R. 2002. Statistical Inference. Duxbury advanced series in statistics and decision sciences. Thomson Learning.

CHAN, J. Y. AND BAUER, C. F. 2014. Identifying at-risk students in general chemistry via cluster analysis of affective characteristics. Journal of Chemical Education 91, 9, 1417–1425.

COLLINS, L. M. AND LANZA, S. T. 2010. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. Vol. 718. John Wiley & Sons.

CSU. 2017a. The California state university: 2017-2018 CSU undergraduate impacted programs matrix.

CSU. 2017b. The California state university: CSU campus impaction information 2017-2018. http: //

CSU. 2018. Impacted undergraduate majors and campuses, 2018-19. https://www2. impacted-degrees.aspx.

DAVIES, R., NYLAND, R., BODILY, R., CHAPMAN, J., JONES, B., AND YOUNG, J. 2017. Designing technology-enabled instruction to utilize learning analytics. TechTrends 61, 2, 155–161.

FARSIDES, T. AND WOODFIELD, R. 2003. Individual differences and undergraduate academic success: The roles of personality, intelligence, and application. Personality and Individual Differences 34, 7, 1225–1243.

FEELDERS, A. 1999. Handling missing data in trees: surrogate splits or statistical imputation? In European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 329–334.

FERN´A NDEZ-DELGADO, M., CERNADAS, E., BARRO, S., AND AMORIM, D. 2014. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research 15, 1, 3133–3181.

FISCHER, E. M. J. 2007. Settling into campus life: Differences by race/ethnicity in college involvement and outcomes. The Journal of Higher Education 78, 2, 125–161.

FRIEDMAN, J., HASTIE, T., AND TIBSHIRANI, R. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics, New York.

GRAY, G., MCGUINNESS, C., AND OWENDE, P. 2014. Non-cognitive factors of learning as predictors of academic performance in tertiary education. In 7th International Conference on Educational Data Mining. International Educational Data Mining Society, 107–114.

GRAY, G., MCGUINNESS, C., OWENDE, P., AND HOFMANN, M. 2016. Learning factor models of students at risk of failing in the early stage of tertiary education. Journal of Learning Analytics 3, 2, 330–372.

GUARCELLO, M. A. 2015. Blended learning and bottlenecks in the California State University system: An empirical look at the importance of demographic and performance analytics. Ph.D. Thesis, University of San Diego.

HARTIGAN, J. A. AND WONG, M. A. 1979. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1, 100–108.

HOUSEMAN, E. A., CHRISTENSEN, B. C., YEH, R.-F., MARSIT, C. J., KARAGAS, M. R., WRENSCH, M., NELSON, H. H., WIEMELS, J., ZHENG, S., WIENCKE, J. K., ET AL. 2008. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 9, 1, 365–380.

JAMES, G., WITTEN, D., HASTIE, T., AND TIBSHIRANI, R. 2013. An introduction to statistical learning. Vol. 112. Springer.

JAYAPRAKASH, S. M., MOODY, E. W., LAUR´IA, E. J., REGAN, J. R., AND BARON, J. D. 2014. Early alert of academically at-risk students: An open source analytics initiative. Journal of Learning Analytics 1, 1, 6–47.

KOESTLER, D. C., MARSIT, C. J., CHRISTENSEN, B. C., KARAGAS, M. R., BUENO, R., SUGARBAKER, D. J., KELSEY, K. T., AND HOUSEMAN, E. A. 2010. Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics 26, 20, 2578–2585.

LIAW, A., WIENER, M., ET AL. 2002. Classification and regression by random forest. R News 2, 3, 18–22.

MASSEY, D. S., CHARLES, C. Z., LUNDY, G., AND FISCHER, M. J. 2011. The source of the river: The social origins of freshmen at America’s selective colleges and universities. Vol. 61. Princeton University Press.

MAULL, K. E., SALDIVAR, M. G., AND SUMNER, T. 2010. Online curriculum planning behavior of teachers. In Proceedings of the Third International Conference on Educational Data Mining.

PAKHIRA, M. K., BANDYOPADHYAY, S., AND MAULIK, U. 2004. Validity index for crisp and fuzzy clusters. Pattern Recognition 37, 3, 487–501.

PAL, N. R. AND BISWAS, J. 1997. Cluster validation using graph theoretic concepts. Pattern Recognition 30, 6, 847–857.

PAPAMITSIOU, Z. AND ECONOMIDES, A. A. 2014. Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence. Journal of Educational Technology & Society 17, 4, 49–64.

RATH, K. A., PETERFREUND, A. R., XENOS, S. P., BAYLISS, F., AND CARNAL, N. 2007. Supplemental instruction in introductory biology I: Enhancing the performance and retention of underrepresented minority students. CBE-Life Sciences Education 6, 3, 203–216.

RAY, S. AND TURI, R. H. 1999. Determination of number of clusters in k-means clustering and application in colour image segmentation. In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques. Calcutta, India, 137–143.

ROMERO, C. AND VENTURA, S. 2010. Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40, 6 (Nov.), 601–618.

SAITTA, S., RAPHAEL, B., AND SMITH, I. F. 2007. A bounded index for cluster validity. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 174–187.

SCHREIBER, J. B. AND PEKARIK, A. J. 2014. Technical note: Using latent class analysis versus k-means or hierarchical clustering to understand museum visitors. Curator: The Museum Journal 57, 1, 45–59.

SDSU. 2018a. Compact for success.

SDSU. 2018b. Educational opportunity programs and ethnic affairs. http://studentaffairs.

SHALABH. 2009. Statistical learning from a regression perspective. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172, 4, 935–935.

SMITH, E. AND HANLEY, G. 2013. Reducing bottlenecks and improving student success. http://

STEELE, C. M. AND ARONSON, J. 2005. Stereotypes and the fragility of academic competence, motivation, and self-concept. Handbook of Competence and Motivation, 436–455.

TALAVERA, L. AND GAUDIOSO, E. 2004. Mining student data to characterize similar behavior groups in unstructured collaboration spaces. In Workshop on Artificial Intelligence in CSCL. 16th European Conference on Artificial Intelligence. 17–23.

TELLES, E. M. AND ORTIZ, V. 2008. Generations of exclusion: Mexican-Americans, assimilation, and race. Russell Sage Foundation.

TSAI, C.-F., TSAI, C.-T., HUNG, C.-S., AND HWANG, P.-S. 2011. Data mining techniques for identifying students at risk of failing a computer proficiency test required for graduation. Australasian Journal of Educational Technology 27, 3, 481–498.

UMKC. 2017. Supplemental instruction.

VAN BUUREN, S. 2013. mice: Multivariate imputation by chained equations. R package version 2.30.

VAN DEN BERGH, M., SCHMITTMANN, V. D., AND VERMUNT, J. K. 2017. Building latent class trees, with an application to a study of social capital. Methodology 13, S1, 13–22.

XU, B. 2011. Clustering educational digital library usage data: Comparisons of latent class analysis and k-means algorithms. Ph.D. Thesis, Utah State University. 41

Most read articles by the same author(s)