ClickTree: A Tree-based Method for Predicting Math Students’ Performance Based on Clickstream Data

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Oct 17, 2024
Narjes Rohani Behnam Rohani Areti Manataki

Abstract

The prediction of student performance and the analysis of students’ learning behaviour play an important role in enhancing online courses. By analysing a massive amount of clickstream data that captures student behaviour, educators can gain valuable insights into the factors that influence students’ academic outcomes and identify areas of improvement in courses. In this study, we developed ClickTree, a treebased methodology, to predict student performance in mathematical problems in end-unit assignments based on students’ clickstream data. Utilising extensive clickstream data, we extracted a novel set of features at three levels, including problem level, assignment-level and student-level, and we trained a CatBoost tree to predict whether a student will successfully answer a problem in an end-unit assignment or not. The developed method achieved an Area under the ROC Curve (AUC) of approximately 79% in the Educational Data Mining Cup 2023 and ranked second in the competition. Our results indicate that students who performed well in end-unit assignment problems engaged more with in-unit assignments and answered more problems correctly, while those who struggled had higher tutoring request rates. We also found that students face more difficulties with “check all that apply” types of problems. Moreover, Algebra II was the most difficult subject for students. The proposed method can be utilised to improve students’ learning experiences, and the insights from this study can be integrated into mathematics courses to enhance students’ learning outcomes. The code and implementation is available at https://www.kaggle.com/code/nargesrohani/clicktree/notebook.

How to Cite

Rohani, N., Rohani, B., & Manataki, A. (2024). ClickTree: A Tree-based Method for Predicting Math Students’ Performance Based on Clickstream Data. Journal of Educational Data Mining, 16(2), 32–57. https://doi.org/10.5281/zenodo.13627655
Abstract 87 | HTML Downloads 22 PDF Downloads 48

##plugins.themes.bootstrap3.article.details##

Keywords

student performance prediction, educational data mining, mathematics, learning behaviour, learning analytics

References
AKÇAPINAR, G., ALTUN, A., AND AŞKAR, P. 2019. Using learning analytics to develop early-warning system for at-risk students. International Journal of Educational Technology in Higher Education 16, 1, 1–20.

AKRAM, A., FU, C., LI, Y., JAVED, M. Y., LIN, R., JIANG, Y., AND TANG, Y. 2019. Predicting students’ academic procrastination in blended learning course using homework submission data. IEEE Access 7, 102487–102498.

AL-AZAZI, F. A. AND GHURAB, M. 2023. ANN-LSTM: A deep learning model for early student performance prediction in mooc. Heliyon 9, 4. Article e15382.

ALJOHANI, N. R., FAYOUMI, A., AND HASSAN, S.-U. 2019. Predicting at-risk students using clickstream data in the virtual learning environment. Sustainability 11, 24. Article 7238.

ASADI, M., SWAMY, V., FREJ, J., VIGNOUD, J., MARRAS, M., AND KÄSER, T. 2023. Ripple: Concept-based interpretation for raw time series models in education. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, B. Williams, C. Yiling, and J. Neville, Eds. Vol. 37. AAAI Press, 15903–15911.

BAKER, R., XU, D., PARK, J., YU, R., LI, Q., CUNG, B., FISCHER, C., RODRIGUEZ, F., WARSCHAUER, M., AND SMYTH, P. 2020. The benefits and caveats of using clickstream data to understand student self-regulatory behaviors: opening the black box of learning processes. International Journal of Educational Technology in Higher Education 17, 1, 1–24.

BRAHIM, G. B. 2022. Predicting student performance from online engagement activities using novel statistical features. Arabian Journal for Science and Engineering 47, 8, 10225–10243.

CHAPELLE, O., MANAVOGLU, E., AND ROSALES, R. 2014. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 4, 1–34.

CHEN, T. AND GUESTRIN, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, Eds. KDD ’16. Association for Computing Machinery, New York, NY, USA, 785–794.

CHENG, E. 2011. The role of self-regulated learning in enhancing learning performance. The International Journal of Research and Review 6, 1–16.

CHOI, J., WALTERS, A., AND HOGE, P. 2017. Self-reflection and math performance in an online learning environment. Online Learning Journal 21, 4, 79–102.

CHOW, T.-C. F. 2011. Students’ difficulties, conceptions and attitudes towards learning algebra: an intervention study to improve teaching and learning. Ph.D. thesis, Curtin University.

CUI, Y., CHEN, F., SHIRI, A., AND FAN, Y. 2019. Predictive analytic models of student success in higher education: A review of methodology. Information and Learning Sciences 120, 3/4, 208–227.

DE BOER, P.-T., KROESE, D. P., MANNOR, S., AND RUBINSTEIN, R. Y. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 19–67.

DEVLIN, J., CHANG, M.-W., LEE, K., AND TOUTANOVA, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.

FRIEDMAN, J. H. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29, 5, 1189 – 1232.

HANCOCK, J. T. AND KHOSHGOFTAAR, T. M. 2020. Catboost for big data: an interdisciplinary review. Journal of big data 7, 1, 1–45.

HEFFERNAN, N. T. AND HEFFERNAN, C. L. 2014. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education 24, 470–497.

HIROSE, H. 2018. Difference between successful and failed students learned from analytics of weekly learning check testing. Information Engineering Express 4, 1, 11–21.

IBRAHIM, A. A., RIDWAN, R. L., MUHAMMED, M. M., ABDULAZIZ, R. O., AND SAHEED, G. A. 2020. Comparison of the catboost classifier with other machine learning methods. International Journal of Advanced Computer Science and Applications 11, 11. Article 11.

JABEUR, S. B., GHARIB, C., MEFTEH-WALI, S., AND ARFI, W. B. 2021. Catboost model and artificial intelligence techniques for corporate failure prediction. Technological Forecasting and Social Change 166. Article 120658.

JANG, Y., CHOI, S., JUNG, H., AND KIM, H. 2022. Practical early prediction of students’ performance using machine learning and explainable AI. Education and Information Technologies 27, 1–35.

KARTIKA, Y., WAHYUNI, R., SINAGA, B., AND RAJAGUKGUK, J. 2019. Improving math creative thinking ability by using math adventure educational game as an interactive media. In Journal of Physics: Conference Series. Vol. 1179. IOP Publishing, 012078.

KEARNS, L. R. 2012. Student assessment in online learning: Challenges and effective practices. Journal of Online Learning and Teaching 8, 3, 198–208.

KOEDINGER, K. R., BRUNSKILL, E., BAKER, R. S., MCLAUGHLIN, E. A., AND STAMPER, J. 2013. New potentials for data-driven intelligent tutoring system development and optimization. AI Magazine 34, 3, 27–41.

KÖRÖSI, G. AND FARKAS, R. 2020. MOOC performance prediction by deep learning from raw clickstream data. In Advances in Computing and Data Sciences: 4th International Conference, ICACDS 2020, Valletta, Malta, April 24–25, 2020, Revised Selected Papers 4, M. Singh, P. K. Gupta, V. Tyagi, J. Flusser, T. Ören, and G. Valentino, Eds. Springer, 474–485.

LEMAY, D. J. AND DOLECK, T. 2022. Predicting completion of massive open online course (MOOC) assignments from video viewing behavior. Interactive Learning Environments 30, 10, 1782–1793.

LI, Q., BAKER, R., AND WARSCHAUER, M. 2020. Using clickstream data to measure, understand, and support self-regulated learning in online courses. The Internet and Higher Education 45. Article 100727.

LIANG, Y., LI, S., YAN, C., LI, M., AND JIANG, C. 2021. Explaining the black-box model: A survey of local interpretation methods for deep neural networks. Neurocomputing 419, 168–182.

LIU, Y., FAN, S., XU, S., SAJJANHAR, A., YEOM, S., AND WEI, Y. 2023. Predicting student performance using clickstream data and machine learning. Education Sciences 13, 1. Article 17.

LÓPEZ ZAMBRANO, J., LARA TORRALBO, J. A., ROMERO MORALES, C., ET AL. 2021. Early prediction of student learning performance through data mining: A systematic review. Psicothema 33, 3, 456–465.

MATCHA, W., GASEVIC, D., UZIR, N. A., JOVANOVIC, J., PARDO, A., LIM, L., MALDONADOMAHAUAD, J., GENTILI, S., PEREZ-SANAGUSTIN, M., AND TSAI, Y.-S. 2020. Analytics of learning strategies: Role of course design and delivery modality. Journal of Learning Analytics 7, 2, 45–71.

MATCHA, W., GAŠEVÍC, D., UZIR, N. A., JOVANOVÍC , J., AND PARDO, A. 2019. Analytics of learning strategies: Associations with academic performance and feedback. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. LAK19. Association for Computing Machinery, New York, NY, USA, 461–470.

MICCI-BARRECA, D. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3, 1, 27–32.

MOYER-PACKENHAM, P. S. AND WESTENSKOW, A. 2013. Effects of virtual manipulatives on student achievement and mathematics learning. International Journal of Virtual & Personal Learning Environments 4, 3, 35–50.

NAMOUN, A. AND ALSHANQITI, A. 2021. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Applied Sciences 11, 1. Article 237.

NEWMAN, R. S. AND SCHWAGER, M. T. 1995. Students’ help seeking during problem solving: Effects of grade, goal, and prior achievement. American Educational Research Journal 32, 2, 352–376.

OLIVA-CORDOVA, L. M., GARCIA-CABOT, A., AND AMADO-SALVATIERRA, H. R. 2021. Learning analytics to support teaching skills: A systematic literature review. IEEE Access 9, 58351–58363.

ÖZEREM, A. 2012. Misconceptions in geometry and suggested solutions for seventh grade students. Procedia-Social and Behavioral Sciences 55, 720–729.

OZTEKIN, A. 2016. A hybrid data analytic approach to predict college graduation status and its determinative factors. Industrial Management & Data Systems 116, 8, 1678–1699.

PARDO, A., JOVANOVIC, J., DAWSON, S., GAŠEVÍC , D., AND MIRRIAHI, N. 2019. Using learning analytics to scale the provision of personalised feedback. British Journal of Educational Technology 50, 1, 128–138.

PEDREGOSA, F., VAROQUAUX, G., GRAMFORT, A., MICHEL, V., THIRION, B., GRISEL, O., BLONDEL, M., PRETTENHOFER, P., WEISS, R., DUBOURG, V., VANDERPLAS, J., PASSOS, A., COURNAPEAU, D., BRUCHER, M., PERROT, M., AND DUCHESNAY, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.

PRAMOD, A., NAICKER, H. S., AND TYAGI, A. K. 2021. Machine Learning and Deep Learning: Open Issues and Future Research Directions for the Next 10 Years. John Wiley Sons, Ltd, Chapter 18, 463–490.

PRIHAR, E. AND HEFFERNAN, N. 2023. EDM cup 2023. https://kaggle.com/ competitions/edm-cup-2023. Accessed: 2024-07-12.

PROKHORENKOVA, L., GUSEV, G., VOROBEV, A., DOROGUSH, A. V., AND GULIN, A. 2018. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems 31, 6639–6649.

QIU, F., ZHANG, G., SHENG, X., JIANG, L., ZHU, L., XIANG, Q., JIANG, B., AND CHEN, P.-K. 2022. Predicting students’ performance in e-learning using learning process and behaviour data. Scientific Reports 12, 1, 453. Article 453.

RIGGS, C. D., KANG, S., AND RENNIE, O. 2020. Positive impact of multiple-choice question authoring and regular quiz participation on student learning. CBE—Life Sciences Education 19, 2. Article ar16.

ROHANI, N., GAL, K., GALLAGHER, M., AND MANATAKI, A. 2022. Discovering students’ learning strategies in a visual programming MOOC through process mining techniques. In International Conference on Process Mining, M. Montali, A. Senderovich, and M. Weidlich, Eds. Springer, 539–551.

ROHANI, N., GAL, K., GALLAGHER, M., AND MANATAKI, A. 2023. Early Prediction of Student Performance in a Health Data Science MOOC. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, 325–333.

ROHANI, N., GAL, K., GALLAGHER, M., AND MANATAKI, A. 2024. Providing insights into health data science education through artificial intelligence. BMC Medical Education 24. Article 564.

RUBIN, D. B. 1981. The Bayesian Bootstrap. The Annals of Statistics 9, 1, 130 – 134.

SCHUMACHER, C. AND IFENTHALER, D. 2018. Features students really expect from learning analytics. Computers in human behavior 78, 397–407.

SEGER, C. 2018. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. https://www.diva-portal.org/smash/ record.jsf?pid=diva2%3A1259073&dswid=-2863. Accessed: 2024-07-12.

SHYAM, R., AYACHIT, S. S., PATIL, V., AND SINGH, A. 2020. Competitive analysis of the top gradient boosting machine learning algorithms. In 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), V. Sharma, R. Srivastava, and M. Singh, Eds. IEEE, 191–196.

SUGIARTI, L. AND RETNAWATI, H. 2019. Analysis of student difficulties on algebra problem solving in junior high school. In Journal of Physics: Conference Series, A. M. Abadi, A. Wijaya, and J. A. Vallejo, Eds. Vol. 1320. IOP Publishing, 012103.

SWAMY, V., DU, S., MARRAS, M., AND KASER, T. 2023. Trusting the explainers: teacher validation of explainable artificial intelligence for course design. In LAK23: 13th International Learning Analytics and Knowledge Conference, S. Joksimovic, A. Barthakur, I. Hilliger, H. Khosravi, B. Rienties, and S. Dawson, Eds. Association for Computing Machinery, 345–356.

TAN SISMAN, G. AND AKSU, M. 2016. A study on sixth grade students’ misconceptions and errors in spatial measurement: Length, area, and volume. International Journal of Science and Mathematics Education 14, 1293–1319.

TUVSHINJARGAL, A. AND KIM, E. 2022. ML vs DL: Accuracy and testing runtime trade-offs in BCI. In International Conference on Human-Computer Interaction, M. Kurosu, S. Yamamoto, H. Mori, D. D. Schmorrow, C. M. Fidopiastis, N. A. Streitz, and S. Konomi, Eds. Springer, 497–511.

USTIMENKO, A. AND PROKHORENKOVA, L. 2021. Sglb: Stochastic gradient langevin boosting. In International Conference on Machine Learning, N. Lawrence, Ed. PMLR, 10487–10496.

WANG, F., HUANG, Z., LIU, Q., CHEN, E., YIN, Y., MA, J., AND WANG, S. 2023. Dynamic cognitive diagnosis: An educational priors-enhanced deep knowledge tracing perspective. IEEE Transactions on Learning Technologies 16, 3, 306–323.

WEN, M. AND ROSE, C. P. 2014. Identifying latent study habits by mining learner behavior patterns in massive open online courses. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, J. Li, X. S. Wang, M. Garofalakis, I. Soboroff, T. Suel, and M.Wang, Eds. CIKM ’14. Association for Computing Machinery, New York, NY, USA, 1983–1986.

YI, J. C., KANG-YI, C. D., BURTON, F., AND CHEN, H. D. 2018. Predictive analytics approach to improve and sustain college students’ non-cognitive skills and their educational outcome. Sustainability 10, 11. Article 4012.

YU, F. AND LI, M. 2011. Effects of different types of online student question-generation on learning. In Proceedings of the 19th International Conference on Computers in Education, ICCE 2011, R. Mizoguchi, O. Sitthisak, T. Hirashima, G. Biswas, T. Supnithi, and F.-Y. Yu, Eds. Proceedings of the 19th International Conference on Computers in Education, ICCE 2011. 768–770.

YÜRÜM, O. R., TAŞKAYA-TEMIZEL, T., AND YILDIRIM, S. 2023. The use of video clickstream data to predict university students’ test performance: A comprehensive educational data mining approach. Education and Information Technologies 28, 5, 5209–5240.
Section
Special Section EDM Cup 2023