Random forests are presented as an analytics foundation for educational data mining tasks. The focus is on course- and program-level analytics including evaluating pedagogical approaches and interventions and identifying and characterizing at-risk students. As part of this development, the concept of individualized treatment effects (ITE) is introduced as a method to provide personalized feedback to students. The ITE quantifies the effectiveness of intervention and/or instructional regimes for a particular student based on institutional student information and performance data. The proposed random forest framework and methods are illustrated in the context of a study of the efficacy of a supplemental, weekly, one-unit problem-solving session in a large enrollment, bottleneck introductory statistics course. The analytics tools are used to identify factors for student success, characterize the benefits of a supplemental instruction section, and suggest intervention initiatives for at-risk groups in the course. In particular, we develop an objective criterion to determine which students should be encouraged, at the beginning of the semester, to join a supplemental instruction section.
How to Cite
random forest, individualized treatment effect, problem solving, statistics education
ARNOLD, K.E. AND PISTILLI, M.D. 2012. Course Signals at Purdue: Using Learning Analytics to Increase Student Success. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge LAK'12 , 267-270.
BAKER, R.S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. JEDM-Journal of Educational Data Mining, 1(1), 3-17.
BREIMAN, L. 2001. Random Forests. Machine Learning 45, 5-32.
DEKKER, G.W., PECHENIZKIY, M. AND VLEESHOUWERS, J.M. 2009. Predicting Students Drop Out: A Case Study. International Working Group on Educational Data Mining.
DELEN, D. 2010. A comparative analysis of machine learning techniques for student retention management. Decision Support Systems, 49(4), 498-506.
DORRESTEIJN, J.A.N., VISSEREN, F.L.J., RIDKER, P.M., WASSINK, A.M.J., PAYNTER, N.P., STEYERBERG, W.W., VAN DER GRAAF, Y. AND COOK, N.R. 2011. Estimating treatment effects for individual patients based on the results of randomised clinical trials. Bmj, 343.
FILELLA, X., ALCOVER, J., MOLINA, R., GIMENEZ, N., RODRIGUEZ, A., JO, J., CARRETERO, P. AND BALLESTA, A.M. 1995. Clinical Usefulness of Free PSA Fraction As an Indicator of Prostate Cancer. International Journal of Cancer, 63, 780-784.
FRITZ, J. 2011. Classroom Walls That Talk: Using Online Course Activity Data of Successful Students to Raise Self-Awareness of Underperforming Peers. The Internet and Higher Education 14, 89-97.
GALLOP, R.J., CRITS-CHRISTOPH, P., MUENZ L.R. AND TU, X.M. 2003. Determination and Interpretation of the Optimal Operating Point for ROC Curves Derived through Generalized Linear Models. Understanding Statistics, 2(4), 219-242.
GOOMAS, D.T. 2014. The Impact of Supplemental Instruction: Results from an Urban Community College. Community College Journal of Research and Practice 38, 1180-1184.
JAMES, G., WITTEN, D., HASTIE, T. AND TIBSHIRANI, R. 2013. An Introduction to Statistical Learning. Springer, New York.
KIM, J.H., PARK, Y., SONG, J. AND JO, I.H. 2014. Predicting Students' Learning Performance by Using Online Behavior Patterns in Blended Learning Environments: Comparison of Two Cases on Linear and Non-linear Model. In Proceedings of the 7th International Conference on Educational Data Mining, 407-408.
KOTSIANTIS, S., PIERRAKEAS, C. AND PINTELAS, P. 2004. Predicting Students' Performance In Distance Learning Using Machine Learning Techniques. Applied Artificial Intelligence, 18(5), 411-426.
KUYORO'SHADE, O., OLUDELE, A., OKOLIE SAMUEL, O. AND NICOLAE, G. 2013. Framework of Recommendation System for Tertiary. Framework, 2(04).
LIAW, A. AND WIENER, M. 2002. Classification and Regression by randomForest. R News 2(3), 18–22.
MACFADYEN, L.P. AND DAWSON, S. 2010. Mining LMS data to develop an early warning system for educators: A proof of concept. Computers & Education, 54(2), 588-599.
MEANS, B., TOYAMA, Y., MURPHY, R., BAKIA, M. AND JONES, K. 2010. Evaluation of evidence-based practices in online learning: A meta-analysis and review of online learning studies. U.S. Department of Education, Office of Planning, Evaluation, and Policy Development, Washington, D.C.
NORRIS, D.M. AND BAER, L.L. 2013. Building Organizational Capacity for Analytics.
EDUCAUSE. PENA-AYALA, A. 2014. Educational data mining: A survey and a data mining-based analysis of recent works. Expert systems with applications, 41(4), 1432-1462.
PHILLIPS, E.D. 2013. Improving Advising Using Technology and Stat Analytics. Change, 48-55.
R CORE TEAM 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
RIDDLE, D.L. AND STRATFORD, P.W. 1999. Interpreting Validity Indexes for Diagnostic Tests: An Illustration Using the Berg Balance Test. Physical Therapy, 79, 939-950.
ROMERO, C. AND VENTURA, S. 2010. Educational data mining: a review of the state of the art. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 40.6, 601-618.
ROSSMAN, A.J. AND CHANCE, B.L. 2014. Using Simulation-Based Inference for Learning Introductory Statistics. Wiley Interdisciplinary Reviews: Computational Statistics 6, 211-221.
SHARABIANI, A., KARIM, F., SHARABIANI, A., ATANASOV, M. AND DARABI, H. 2014. An enhanced Bayesian network model for prediction of students' academic performance in engineering programs. In Global Engineering Education Conference (EDUCON), 2014 IEEE (pp. 832-837). IEEE.
SUPERBY, J.F., VANDAMME, J.P. AND MESKENS, N. 2006. Determination of factors influencing the achievement of the first-year university students using data mining methods. In Workshop on Educational Data Mining (pp. 37-44).
VAN BARNEVELD, A., ARNOLD, K.E. AND CAMPBELL, J.P. 2012. Analytics in Higher Education: Establishing a Common Language. EDUCAUSE Learning Initiative Paper, 1-11.
VAN BUUREN, STEF. AND GROOTHUIS-OUDSHOORN, K. 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. URL http://www.jstatsoft.org/v45/i03/.
WICKMAN, H. 2009. ggplot2: elegant graphics for data analysis. Springer New York.
ZHANG, Y., OUSSENA, S., CLARK, T., AND HYENSOOK, K. 2010. Using data mining to improve student retention in HE: a case study. In Proceedings of ICEIS 12th International Conference on Enterprise Information Systems, Portugal
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.