Higher education institutions often examine performance discrepancies of specific subgroups, such as students from underrepresented minority and first-generation backgrounds. An increase in educational technology and computational power has promoted research interest in using data mining tools to help identify groups of students who are academically at-risk. Institutions can then implement data-informed decisions to help promote student access, increase retention and graduation rates, and guide intervention programs. We introduce a latent class forest, a latent class analysis and a random forest ensemble that will recursively partition observations into groups to help identify at-risk students. The procedure is a form of model-based hierarchical clustering that relies on latent class trees to optimally identify subgroups.
We motivate and apply our latent class forest method to identify key demographic and academic characteristics of at-risk students in a large enrollment, bottleneck introductory psychology course at San Diego State University (SDSU). A post hoc analysis is conducted to measure the efficacy of Supplemental Instruction (SI) across these groups. SI is a peer-led academic intervention that targets historically challenging courses and aims to increase student performance. In doing so, we are able to identify populations that benefit most from SI to guide program recruitment and help increase the introductory psychology course success rate.
How to Cite
Latent Class Analysis, supplemental instruction, clustering, at-risk students, Latent Class Forest
BRAXTON, J. M. 2000. Reworking the student departure puzzle. Vanderbilt University Press, 2000.
BREIMAN, L. 2001. Random forests. Machine Learning 45, 1, 5–32.
BRUSCO, M. J., SHIREMAN, E., AND STEINLEY, D. 2016. A comparison of latent class, k-means, and k-median methods for clustering dichotomous data. Psychological methods 22, 3, 563.
CARUANA, R., KARAMPATZIAKIS, N., AND YESSENALINA, A. 2008. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning. ACM, 96–103.
CARUANA, R. AND NICULESCU-MIZIL, A. 2006. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 161–168.
CASELLA, G. AND BERGER, R. 2002. Statistical Inference. Duxbury advanced series in statistics and decision sciences. Thomson Learning.
CHAN, J. Y. AND BAUER, C. F. 2014. Identifying at-risk students in general chemistry via cluster analysis of affective characteristics. Journal of Chemical Education 91, 9, 1417–1425.
COLLINS, L. M. AND LANZA, S. T. 2010. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. Vol. 718. John Wiley & Sons.
CSU. 2017a. The California state university: 2017-2018 CSU undergraduate impacted programs matrix. https://www.calstate.edu/sas/documents/impactedprogramsmatrix.pdf.
CSU. 2017b. The California state university: CSU campus impaction information 2017-2018. http://www.calstate.edu/sas/impaction-campus-info.shtml.
CSU. 2018. Impacted undergraduate majors and campuses, 2018-19. https://www2.calstate.edu/attend/degrees-certificates-credentials/Pages/impacted-degrees.aspx.
DAVIES, R., NYLAND, R., BODILY, R., CHAPMAN, J., JONES, B., AND YOUNG, J. 2017. Designing technology-enabled instruction to utilize learning analytics. TechTrends 61, 2, 155–161.
FARSIDES, T. AND WOODFIELD, R. 2003. Individual differences and undergraduate academic success: The roles of personality, intelligence, and application. Personality and Individual Differences 34, 7, 1225–1243.
FEELDERS, A. 1999. Handling missing data in trees: surrogate splits or statistical imputation? In European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 329–334.
FERNÁNDEZ -DELGADO, M., CERNADAS, E., BARRO, S., AND AMORIM, D. 2014. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research 15, 1, 3133–3181.
FISCHER, E. M. J. 2007. Settling into campus life: Differences by race/ethnicity in college involvement and outcomes. The Journal of Higher Education 78, 2, 125–161.
FRIEDMAN, J., HASTIE, T., AND TIBSHIRANI, R. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics, New York.
GRAY, G., MCGUINNESS, C., AND OWENDE, P. 2014. Non-cognitive factors of learning as predictors of academic performance in tertiary education. In 7th International Conference on Educational Data Mining. International Educational Data Mining Society, 107–114.
GRAY, G., MCGUINNESS, C., OWENDE, P., AND HOFMANN, M. 2016. Learning factor models of students at risk of failing in the early stage of tertiary education. Journal of Learning Analytics 3, 2, 330–372.
GUARCELLO, M. A. 2015. Blended learning and bottlenecks in the California State University system: An empirical look at the importance of demographic and performance analytics. Ph.D. Thesis, University of San Diego.
HARTIGAN, J. A. AND WONG, M. A. 1979. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1, 100–108.
HOUSEMAN, E. A., CHRISTENSEN, B. C., YEH, R.-F., MARSIT, C. J., KARAGAS, M. R., WRENSCH, M., NELSON, H. H., WIEMELS, J., ZHENG, S., WIENCKE, J. K., ET AL . 2008. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 9, 1, 365–380.
JAMES, G., WITTEN, D., HASTIE, T., AND TIBSHIRANI, R. 2013. An introduction to statistical learning. Vol. 112. Springer.
JAYAPRAKASH, S. M., MOODY, E. W., LAURÍA, E. J., REGAN, J. R., AND BARON, J. D. 2014. Early alert of academically at-risk students: An open source analytics initiative. Journal of Learning Analytics 1, 1, 6–47.
KOESTLER, D. C., MARSIT, C. J., CHRISTENSEN, B. C., KARAGAS, M. R., BUENO, R., SUGAR-BAKER, D. J., KELSEY, K. T., AND HOUSEMAN, E. A. 2010. Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics 26, 20, 2578–2585.
LIAW, A., WIENER, M., ET AL . 2002. Classification and regression by random forest. R News 2, 3, 18–22.
MASSEY, D. S., CHARLES, C. Z., LUNDY, G., AND FISCHER, M. J. 2011. The source of the river: The social origins of freshmen at America’s selective colleges and universities. Vol. 61. Princeton University Press.
MAULL, K. E., SALDIVAR, M. G., AND SUMNER, T. 2010. Online curriculum planning behavior of teachers. In Proceedings of the Third International Conference on Educational Data Mining.
PAKHIRA, M. K., BANDYOPADHYAY, S., AND MAULIK, U. 2004. Validity index for crisp and fuzzy clusters. Pattern Recognition 37, 3, 487–501.
PAL, N. R. AND BISWAS, J. 1997. Cluster validation using graph theoretic concepts. Pattern Recognition 30, 6, 847–857.
PAPAMITSIOU, Z. AND ECONOMIDES, A. A. 2014. Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence. Journal of Educational Technology & Society 17, 4, 49–64.
RATH, K. A., PETERFREUND, A. R., XENOS, S. P., BAYLISS, F., AND CARNAL, N. 2007. Supplemental instruction in introductory biology I: Enhancing the performance and retention of underrepresented minority students. CBE-Life Sciences Education 6, 3, 203–216.
RAY, S. AND TURI, R. H. 1999. Determination of number of clusters in k-means clustering and application in colour image segmentation. In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques. Calcutta, India, 137–143.
ROMERO, C. AND VENTURA, S. 2010. Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40, 6 (Nov.), 601–618.
SAITTA, S., RAPHAEL, B., AND SMITH, I. F. 2007. A bounded index for cluster validity. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 174–187.
SCHREIBER, J. B. AND PEKARIK, A. J. 2014. Technical note: Using latent class analysis versus k-means or hierarchical clustering to understand museum visitors. Curator: The Museum Journal 57, 1, 45–59.
SDSU. 2018a. Compact for success. http://compactforsuccess.sdsu.edu/.
SDSU. 2018b. Educational opportunity programs and ethnic affairs. http://studentaffairs. sdsu.edu/EOP/.
SHALABH . 2009. Statistical learning from a regression perspective. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172, 4, 935–935.
SMITH, E. AND HANLEY, G. 2013. Reducing bottlenecks and improving student success. http://www.calstate.edu/bot/agendas/sep13/Agenda.pdf.
STEELE, C. M. AND ARONSON, J. 2005. Stereotypes and the fragility of academic competence, motivation, and self-concept. Handbook of Competence and Motivation, 436–455.
TALAVERA, L. AND GAUDIOSO, E. 2004. Mining student data to characterize similar behavior groups in unstructured collaboration spaces. In Workshop on Artificial Intelligence in CSCL. 16th European Conference on Artificial Intelligence. 17–23.
TELLES, E. M. AND ORTIZ, V. 2008. Generations of exclusion: Mexican-Americans, assimilation, and race. Russell Sage Foundation.
TSAI, C.-F., TSAI, C.-T., HUNG, C.-S., AND HWANG, P.-S. 2011. Data mining techniques for identifying students at risk of failing a computer proficiency test required for graduation. Australasian Journal of Educational Technology 27, 3, 481–498.
UMKC. 2017. Supplemental instruction. http://info.umkc.edu/si/.
VAN BUUREN, S. 2013. mice: Multivariate imputation by chained equations. R package version 2.30.
VAN DEN BERGH, M., SCHMITTMANN, V. D., AND VERMUNT, J. K. 2017. Building latent class trees,
with an application to a study of social capital. Methodology 13, S1, 13–22.
XU, B. 2011. Clustering educational digital library usage data: Comparisons of latent class analysis and k-means algorithms. Ph.D. Thesis, Utah State University.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.