ClickTree: A Tree-based Method for Predicting Math Students’ Performance Based on Clickstream Data
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
The prediction of student performance and the analysis of students’ learning behaviour play an important role in enhancing online courses. By analysing a massive amount of clickstream data that captures student behaviour, educators can gain valuable insights into the factors that influence students’ academic outcomes and identify areas of improvement in courses. In this study, we developed ClickTree, a treebased methodology, to predict student performance in mathematical problems in end-unit assignments based on students’ clickstream data. Utilising extensive clickstream data, we extracted a novel set of features at three levels, including problem level, assignment-level and student-level, and we trained a CatBoost tree to predict whether a student will successfully answer a problem in an end-unit assignment or not. The developed method achieved an Area under the ROC Curve (AUC) of approximately 79% in the Educational Data Mining Cup 2023 and ranked second in the competition. Our results indicate that students who performed well in end-unit assignment problems engaged more with in-unit assignments and answered more problems correctly, while those who struggled had higher tutoring request rates. We also found that students face more difficulties with “check all that apply” types of problems. Moreover, Algebra II was the most difficult subject for students. The proposed method can be utilised to improve students’ learning experiences, and the insights from this study can be integrated into mathematics courses to enhance students’ learning outcomes. The code and implementation is available at https://www.kaggle.com/code/nargesrohani/clicktree/notebook.
How to Cite
##plugins.themes.bootstrap3.article.details##
student performance prediction, educational data mining, mathematics, learning behaviour, learning analytics
AKRAM, A., FU, C., LI, Y., JAVED, M. Y., LIN, R., JIANG, Y., AND TANG, Y. 2019. Predicting students’ academic procrastination in blended learning course using homework submission data. IEEE Access 7, 102487–102498.
AL-AZAZI, F. A. AND GHURAB, M. 2023. ANN-LSTM: A deep learning model for early student performance prediction in mooc. Heliyon 9, 4. Article e15382.
ALJOHANI, N. R., FAYOUMI, A., AND HASSAN, S.-U. 2019. Predicting at-risk students using clickstream data in the virtual learning environment. Sustainability 11, 24. Article 7238.
ASADI, M., SWAMY, V., FREJ, J., VIGNOUD, J., MARRAS, M., AND KÄSER, T. 2023. Ripple: Concept-based interpretation for raw time series models in education. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, B. Williams, C. Yiling, and J. Neville, Eds. Vol. 37. AAAI Press, 15903–15911.
BAKER, R., XU, D., PARK, J., YU, R., LI, Q., CUNG, B., FISCHER, C., RODRIGUEZ, F., WARSCHAUER, M., AND SMYTH, P. 2020. The benefits and caveats of using clickstream data to understand student self-regulatory behaviors: opening the black box of learning processes. International Journal of Educational Technology in Higher Education 17, 1, 1–24.
BRAHIM, G. B. 2022. Predicting student performance from online engagement activities using novel statistical features. Arabian Journal for Science and Engineering 47, 8, 10225–10243.
CHAPELLE, O., MANAVOGLU, E., AND ROSALES, R. 2014. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 4, 1–34.
CHEN, T. AND GUESTRIN, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, Eds. KDD ’16. Association for Computing Machinery, New York, NY, USA, 785–794.
CHENG, E. 2011. The role of self-regulated learning in enhancing learning performance. The International Journal of Research and Review 6, 1–16.
CHOI, J., WALTERS, A., AND HOGE, P. 2017. Self-reflection and math performance in an online learning environment. Online Learning Journal 21, 4, 79–102.
CHOW, T.-C. F. 2011. Students’ difficulties, conceptions and attitudes towards learning algebra: an intervention study to improve teaching and learning. Ph.D. thesis, Curtin University.
CUI, Y., CHEN, F., SHIRI, A., AND FAN, Y. 2019. Predictive analytic models of student success in higher education: A review of methodology. Information and Learning Sciences 120, 3/4, 208–227.
DE BOER, P.-T., KROESE, D. P., MANNOR, S., AND RUBINSTEIN, R. Y. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 19–67.
DEVLIN, J., CHANG, M.-W., LEE, K., AND TOUTANOVA, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
FRIEDMAN, J. H. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29, 5, 1189 – 1232.
HANCOCK, J. T. AND KHOSHGOFTAAR, T. M. 2020. Catboost for big data: an interdisciplinary review. Journal of big data 7, 1, 1–45.
HEFFERNAN, N. T. AND HEFFERNAN, C. L. 2014. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education 24, 470–497.
HIROSE, H. 2018. Difference between successful and failed students learned from analytics of weekly learning check testing. Information Engineering Express 4, 1, 11–21.
IBRAHIM, A. A., RIDWAN, R. L., MUHAMMED, M. M., ABDULAZIZ, R. O., AND SAHEED, G. A. 2020. Comparison of the catboost classifier with other machine learning methods. International Journal of Advanced Computer Science and Applications 11, 11. Article 11.
JABEUR, S. B., GHARIB, C., MEFTEH-WALI, S., AND ARFI, W. B. 2021. Catboost model and artificial intelligence techniques for corporate failure prediction. Technological Forecasting and Social Change 166. Article 120658.
JANG, Y., CHOI, S., JUNG, H., AND KIM, H. 2022. Practical early prediction of students’ performance using machine learning and explainable AI. Education and Information Technologies 27, 1–35.
KARTIKA, Y., WAHYUNI, R., SINAGA, B., AND RAJAGUKGUK, J. 2019. Improving math creative thinking ability by using math adventure educational game as an interactive media. In Journal of Physics: Conference Series. Vol. 1179. IOP Publishing, 012078.
KEARNS, L. R. 2012. Student assessment in online learning: Challenges and effective practices. Journal of Online Learning and Teaching 8, 3, 198–208.
KOEDINGER, K. R., BRUNSKILL, E., BAKER, R. S., MCLAUGHLIN, E. A., AND STAMPER, J. 2013. New potentials for data-driven intelligent tutoring system development and optimization. AI Magazine 34, 3, 27–41.
KÖRÖSI, G. AND FARKAS, R. 2020. MOOC performance prediction by deep learning from raw clickstream data. In Advances in Computing and Data Sciences: 4th International Conference, ICACDS 2020, Valletta, Malta, April 24–25, 2020, Revised Selected Papers 4, M. Singh, P. K. Gupta, V. Tyagi, J. Flusser, T. Ören, and G. Valentino, Eds. Springer, 474–485.
LEMAY, D. J. AND DOLECK, T. 2022. Predicting completion of massive open online course (MOOC) assignments from video viewing behavior. Interactive Learning Environments 30, 10, 1782–1793.
LI, Q., BAKER, R., AND WARSCHAUER, M. 2020. Using clickstream data to measure, understand, and support self-regulated learning in online courses. The Internet and Higher Education 45. Article 100727.
LIANG, Y., LI, S., YAN, C., LI, M., AND JIANG, C. 2021. Explaining the black-box model: A survey of local interpretation methods for deep neural networks. Neurocomputing 419, 168–182.
LIU, Y., FAN, S., XU, S., SAJJANHAR, A., YEOM, S., AND WEI, Y. 2023. Predicting student performance using clickstream data and machine learning. Education Sciences 13, 1. Article 17.
LÓPEZ ZAMBRANO, J., LARA TORRALBO, J. A., ROMERO MORALES, C., ET AL. 2021. Early prediction of student learning performance through data mining: A systematic review. Psicothema 33, 3, 456–465.
MATCHA, W., GASEVIC, D., UZIR, N. A., JOVANOVIC, J., PARDO, A., LIM, L., MALDONADOMAHAUAD, J., GENTILI, S., PEREZ-SANAGUSTIN, M., AND TSAI, Y.-S. 2020. Analytics of learning strategies: Role of course design and delivery modality. Journal of Learning Analytics 7, 2, 45–71.
MATCHA, W., GAŠEVÍC, D., UZIR, N. A., JOVANOVÍC , J., AND PARDO, A. 2019. Analytics of learning strategies: Associations with academic performance and feedback. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. LAK19. Association for Computing Machinery, New York, NY, USA, 461–470.
MICCI-BARRECA, D. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3, 1, 27–32.
MOYER-PACKENHAM, P. S. AND WESTENSKOW, A. 2013. Effects of virtual manipulatives on student achievement and mathematics learning. International Journal of Virtual & Personal Learning Environments 4, 3, 35–50.
NAMOUN, A. AND ALSHANQITI, A. 2021. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Applied Sciences 11, 1. Article 237.
NEWMAN, R. S. AND SCHWAGER, M. T. 1995. Students’ help seeking during problem solving: Effects of grade, goal, and prior achievement. American Educational Research Journal 32, 2, 352–376.
OLIVA-CORDOVA, L. M., GARCIA-CABOT, A., AND AMADO-SALVATIERRA, H. R. 2021. Learning analytics to support teaching skills: A systematic literature review. IEEE Access 9, 58351–58363.
ÖZEREM, A. 2012. Misconceptions in geometry and suggested solutions for seventh grade students. Procedia-Social and Behavioral Sciences 55, 720–729.
OZTEKIN, A. 2016. A hybrid data analytic approach to predict college graduation status and its determinative factors. Industrial Management & Data Systems 116, 8, 1678–1699.
PARDO, A., JOVANOVIC, J., DAWSON, S., GAŠEVÍC , D., AND MIRRIAHI, N. 2019. Using learning analytics to scale the provision of personalised feedback. British Journal of Educational Technology 50, 1, 128–138.
PEDREGOSA, F., VAROQUAUX, G., GRAMFORT, A., MICHEL, V., THIRION, B., GRISEL, O., BLONDEL, M., PRETTENHOFER, P., WEISS, R., DUBOURG, V., VANDERPLAS, J., PASSOS, A., COURNAPEAU, D., BRUCHER, M., PERROT, M., AND DUCHESNAY, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.
PRAMOD, A., NAICKER, H. S., AND TYAGI, A. K. 2021. Machine Learning and Deep Learning: Open Issues and Future Research Directions for the Next 10 Years. John Wiley Sons, Ltd, Chapter 18, 463–490.
PRIHAR, E. AND HEFFERNAN, N. 2023. EDM cup 2023. https://kaggle.com/ competitions/edm-cup-2023. Accessed: 2024-07-12.
PROKHORENKOVA, L., GUSEV, G., VOROBEV, A., DOROGUSH, A. V., AND GULIN, A. 2018. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems 31, 6639–6649.
QIU, F., ZHANG, G., SHENG, X., JIANG, L., ZHU, L., XIANG, Q., JIANG, B., AND CHEN, P.-K. 2022. Predicting students’ performance in e-learning using learning process and behaviour data. Scientific Reports 12, 1, 453. Article 453.
RIGGS, C. D., KANG, S., AND RENNIE, O. 2020. Positive impact of multiple-choice question authoring and regular quiz participation on student learning. CBE—Life Sciences Education 19, 2. Article ar16.
ROHANI, N., GAL, K., GALLAGHER, M., AND MANATAKI, A. 2022. Discovering students’ learning strategies in a visual programming MOOC through process mining techniques. In International Conference on Process Mining, M. Montali, A. Senderovich, and M. Weidlich, Eds. Springer, 539–551.
ROHANI, N., GAL, K., GALLAGHER, M., AND MANATAKI, A. 2023. Early Prediction of Student Performance in a Health Data Science MOOC. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, 325–333.
ROHANI, N., GAL, K., GALLAGHER, M., AND MANATAKI, A. 2024. Providing insights into health data science education through artificial intelligence. BMC Medical Education 24. Article 564.
RUBIN, D. B. 1981. The Bayesian Bootstrap. The Annals of Statistics 9, 1, 130 – 134.
SCHUMACHER, C. AND IFENTHALER, D. 2018. Features students really expect from learning analytics. Computers in human behavior 78, 397–407.
SEGER, C. 2018. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. https://www.diva-portal.org/smash/ record.jsf?pid=diva2%3A1259073&dswid=-2863. Accessed: 2024-07-12.
SHYAM, R., AYACHIT, S. S., PATIL, V., AND SINGH, A. 2020. Competitive analysis of the top gradient boosting machine learning algorithms. In 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), V. Sharma, R. Srivastava, and M. Singh, Eds. IEEE, 191–196.
SUGIARTI, L. AND RETNAWATI, H. 2019. Analysis of student difficulties on algebra problem solving in junior high school. In Journal of Physics: Conference Series, A. M. Abadi, A. Wijaya, and J. A. Vallejo, Eds. Vol. 1320. IOP Publishing, 012103.
SWAMY, V., DU, S., MARRAS, M., AND KASER, T. 2023. Trusting the explainers: teacher validation of explainable artificial intelligence for course design. In LAK23: 13th International Learning Analytics and Knowledge Conference, S. Joksimovic, A. Barthakur, I. Hilliger, H. Khosravi, B. Rienties, and S. Dawson, Eds. Association for Computing Machinery, 345–356.
TAN SISMAN, G. AND AKSU, M. 2016. A study on sixth grade students’ misconceptions and errors in spatial measurement: Length, area, and volume. International Journal of Science and Mathematics Education 14, 1293–1319.
TUVSHINJARGAL, A. AND KIM, E. 2022. ML vs DL: Accuracy and testing runtime trade-offs in BCI. In International Conference on Human-Computer Interaction, M. Kurosu, S. Yamamoto, H. Mori, D. D. Schmorrow, C. M. Fidopiastis, N. A. Streitz, and S. Konomi, Eds. Springer, 497–511.
USTIMENKO, A. AND PROKHORENKOVA, L. 2021. Sglb: Stochastic gradient langevin boosting. In International Conference on Machine Learning, N. Lawrence, Ed. PMLR, 10487–10496.
WANG, F., HUANG, Z., LIU, Q., CHEN, E., YIN, Y., MA, J., AND WANG, S. 2023. Dynamic cognitive diagnosis: An educational priors-enhanced deep knowledge tracing perspective. IEEE Transactions on Learning Technologies 16, 3, 306–323.
WEN, M. AND ROSE, C. P. 2014. Identifying latent study habits by mining learner behavior patterns in massive open online courses. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, J. Li, X. S. Wang, M. Garofalakis, I. Soboroff, T. Suel, and M.Wang, Eds. CIKM ’14. Association for Computing Machinery, New York, NY, USA, 1983–1986.
YI, J. C., KANG-YI, C. D., BURTON, F., AND CHEN, H. D. 2018. Predictive analytics approach to improve and sustain college students’ non-cognitive skills and their educational outcome. Sustainability 10, 11. Article 4012.
YU, F. AND LI, M. 2011. Effects of different types of online student question-generation on learning. In Proceedings of the 19th International Conference on Computers in Education, ICCE 2011, R. Mizoguchi, O. Sitthisak, T. Hirashima, G. Biswas, T. Supnithi, and F.-Y. Yu, Eds. Proceedings of the 19th International Conference on Computers in Education, ICCE 2011. 768–770.
YÜRÜM, O. R., TAŞKAYA-TEMIZEL, T., AND YILDIRIM, S. 2023. The use of video clickstream data to predict university students’ test performance: A comprehensive educational data mining approach. Education and Information Technologies 28, 5, 5209–5240.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.