Empirical Evaluation of Deep Learning Models for Knowledge Tracing: Of Hyperparameters and Metrics on Performance and Replicability

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Oct 1, 2022
Sami Sarsa Juho Leinonen Arto Hellas

Abstract

New knowledge tracing models are continuously being proposed, even at a pace where state-of-theart
models cannot be compared with each other at the time of publication. This leads to a situation
where ranking models is hard, and the underlying reasons of the models’ performance – be it architectural
choices, hyperparameter tuning, performance metrics, or data – is often underexplored. In this
work, we review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly
available and widely-used data sets, and with a novel data set of students learning to program. The
evaluated knowledge tracing models include Vanilla-DKT, two Long Short-Term Memory Deep Knowledge
Tracing (LSTM-DKT) variants, two Dynamic Key-Value Memory Network (DKVMN) variants,
and Self-Attentive Knowledge Tracing (SAKT). As baselines, we evaluate simple non-learning models,
logistic regression and Bayesian Knowledge Tracing (BKT). To evaluate how different aspects of DLKT
models influence model performance, we test input and output layer variations found in the compared
models that are independent of the main architectures. We study maximum attempt count options, including
filtering out long attempt sequences, that have been implicitly and explicitly used in prior studies.
We contrast the observed performance variations against variations from non-model properties such as
randomness and hardware. Performance of models is assessed using multiple metrics, whereby we also
contrast the impact of the choice of metric on model performance. The key contributions of this work are
the following: Evidence that DLKT models generally outperform more traditional models, but not necessarily
by much and not always; Evidence that even simple baselines with little to no predictive value
may outperform DLKT models, especially in terms of accuracy – highlighting importance of selecting
proper baselines for comparison; Disambiguation of properties that lead to better performance in DLKT
models including metric choice, input and output layer variations, common hyperparameters, random
seeding and hardware; Discussion of issues in replicability when evaluating DLKT models, including
discrepancies in prior reported results and methodology. Model implementations, evaluation code, and
data are published as a part of this work.

How to Cite

Sarsa, S., Leinonen, J., & Hellas, A. (2022). Empirical Evaluation of Deep Learning Models for Knowledge Tracing: Of Hyperparameters and Metrics on Performance and Replicability. Journal of Educational Data Mining, 14(2). https://doi.org/10.5281/zenodo.7086179
Abstract 184 | PDF Downloads 91

##plugins.themes.bootstrap3.article.details##

Keywords

knowledge tracing, deep learning, memory networks, attention-based models, hyperparameter optimization, evaluation metrics, replicability

References
ABDELRAHMAN, G. AND WANG, Q. 2019. Knowledge tracing with sequential key-value memory networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Y. Maarek, J.-Y. Nie, and F. Scholer, Eds. ACM Press, New York, NY, USA, 175–184.

AHADI, A., HELLAS, A., IHANTOLA, P., KORHONEN, A., AND PETERSEN, A. 2016. Replication in computing education research: Researcher attitudes and experiences. In Proceedings of the 16th Koli Calling International Conference on Computing Education Research, J. Sheard and C. S. Montero, Eds. Koli Calling ’16. Association for Computing Machinery, New York, NY, USA, 2–11.

ALEXANDRON, G., YOO, L. Y., RUIPÉ REZ-VALIENTE, J. A., LEE, S., AND PRITCHARD, D. E. 2019. Are MOOC learning analytics results trustworthy? with fake learners, they might not be! International Journal of Artificial Intelligence in Education 29, 4, 484–506.

ANDERSON, C. J., Š TÊPÁN BAHNÍ K, BARNETT-COWAN, M., BOSCO, F. A., CHANDLER, J., CHARTIER, C. R., CHEUNG, F., CHRISTOPHERSON, C. D., CORDES, A., CREMATA, E. J., PENNA, N. D., ESTEL, V., FEDOR, A., FITNEVA, S. A., FRANK, M. C., GRANGE, J. A., HARTSHORNE, J. K., HASSELMAN, F., HENNINGER, F., VAN DER HULST, M., JONAS, K. J., LAI, C. K., LEVITAN, C. A., MILLER, J. K., MOORE, K. S., MEIXNER, J. M., MUNAF`O, M. R., NEIJENHUIJS, K. I., NILSONNE, G., NOSEK, B. A., PLESSOW, F., PRENOVEAU, J. M., RICKER, A. A., SCHMIDT, K., SPIES, J. R., STIEGER, S., STROHMINGER, N., SULLIVAN, G. B., VAN AERT, R. C. M., VAN ASSEN, M. A. L. M., VANPAEMEL, W., VIANELLO, M., VORACEK, M., AND ZUNI, K. 2016. Response to comment on “estimating the reproducibility of psychological science”. Science 351, 6277, 1037–1037.

ANDERSON, J. R., BOYLE, C. F., AND REISER, B. J. 1985. Intelligent tutoring systems. Science 228, 4698, 456–462.

ASENDORPF, J. B., CONNER, M., DE FRUYT, F., DE HOUWER, J., DENISSEN, J. J. A., FIEDLER, K., FIEDLER, S., FUNDER, D. C., KLIEGL, R., NOSEK, B. A., PERUGINI, M., ROBERTS, B. W., SCHMITT, M., VAN AKEN, M. A. G., WEBER, H., AND WICHERTS, J. M. 2013. Recommendations for increasing replicability in psychology. European Journal of Personality 27, 2, 108–119.

BAKER, M. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533, 7604 (May), 452–454.

BAKER, R. S. J. D., CORBETT, A. T., AND ALEVEN, V. 2008. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In International Conference on Intelligent Tutoring Systems, B. P.Woolf, E. A¨ımeur, R. Nkambou, and S. Lajoie, Eds. Springer, 406–415.

BEGLEY, C. G. AND ELLIS, L. M. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483, 7391, 531–533.

BHANDARI NEUPANE, J., NEUPANE, R. P., LUO, Y., YOSHIDA, W. Y., SUN, R., AND WILLIAMS, P. G. 2019. Characterization of leptazolines a–d, polar oxazolines from the cyanobacterium leptolyngbya sp., reveals a glitch with the “willoughby–hoye” scripts for calculating nmr chemical shifts. Organic Letters 21, 20, 8449–8453.

BIANCHINI, M. AND SCARSELLI, F. 2014. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems 25, 8, 1553–1565.

BOUTHILLIER, X., DELAUNAY, P., BRONZI, M., TROFIMOV, A., NICHYPORUK, B., SZETO, J., MOHAMMADI SEPAHVAND, N., RAFF, E., MADAN, K., VOLETI, V., EBRAHIMI KAHOU, S., MICHALSKI, V., ARBEL, T., PAL, C., VAROQUAUX, G., AND VINCENT, P. 2021. Accounting for variance in machine learning benchmarks. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica, Eds. Vol. 3. 747–769.

BROWN, T., MANN, B., RYDER, N., SUBBIAH, M., KAPLAN, J. D., DHARIWAL, P., NEELAKANTAN, A., SHYAM, P., SASTRY, G., ASKELL, A., AGARWAL, S., HERBERT-VOSS, A., KRUEGER, G., HENIGHAN, T., CHILD, R., RAMESH, A., ZIEGLER, D., WU, J., WINTER, C., HESSE, C., CHEN, M., SIGLER, E., LITWIN, M., GRAY, S., CHESS, B., CLARK, J., BERNER, C., MCCANDLISH, S., RADFORD, A., SUTSKEVER, I., AND AMODEI, D. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds. Curran Associates, Inc., 1877–1901.

CARUANA, R. AND NICULESCU-MIZIL, A. 2004. Data mining in metric space: An empirical analysis of supervised learning performance criteria. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, J. Gehrke and W. DuMouchel, Eds. KDD ’04. Association for Computing Machinery, New York, NY, USA, 69–78.

CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning factors analysis–a general method for cognitive model evaluation and improvement. In Proceedings of 8th International Conference on Intelligent Tutoring Systems, M. Ikeda, K. D. Ashley, and T.-W. Chan, Eds. Springer, 164–175.

CHANG, K.-M., BECK, J. E., MOSTOW, J., AND CORBETT, A. 2006. Does help help? a Bayes net approach to modeling tutor interventions. In AAAI Workshop on Educational Data Mining. AAAI, 41–46.

CHENG, S., LIU, Q., CHEN, E., ZHANG, K., HUANG, Z., YIN, Y., HUANG, X., AND SU, Y. 2022. Adaptkt: A domain adaptable method for knowledge tracing. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, L. Akoglu, X. L. Dong, and J. Tang, Eds. WSDM ’22. Association for Computing Machinery, New York, NY, USA, 123–131.

CHICCO, D. AND JURMAN, G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 1, 1–13.

CHOFFIN, B., POPINEAU, F., BOURDA, Y., AND VIE, J.-J. 2019. Das3h: Modeling student learning and forgetting for optimally scheduling distributed practice of skills. In Proceedings of The 12th International Conference on Educational Data Mining, C. F. Lynch, A. Merceron, M. Desmarais, and R. Nkambou, Eds. International Educational Data Mining Society, 29–38.

CHOI, Y., LEE, Y., CHO, J., BAEK, J., KIM, B., CHA, Y., SHIN, D., BAE, C., AND HEO, J. 2020. Towards an appropriate query, key, and value computation for knowledge tracing. In Proceedings of the 7th ACM Conference on Learning at Scale, R. Kizilcec and S. Singer, Eds. Association for Computing Machinery, 341–344.

CHOI, Y., LEE, Y., SHIN, D., CHO, J., PARK, S., LEE, S., BAEK, J., BAE, C., KIM, B., AND HEO, J. 2020. EdNet: A large-scale hierarchical dataset in education. In International Conference on Artificial Intelligence in Education, I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, and E. Mill´an, Eds. Springer, 69–73.

CORBETT, A. T. AND ANDERSON, J. R. 1994. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction 4, 4, 253–278.

DEVASENA, C. L., SUMATHI, T., GOMATHI, V., AND HEMALATHA, M. 2011. Effectiveness evaluation of rule based classifiers for the classification of iris data set. Bonfring International Journal of Man Machine Interface 1, Inaugural Special Issue, 05–09.

DHANANI, A., LEE, S. Y., PHOTHILIMTHANA, P. M., AND PARDOS, Z. 2014. A comparison of error metrics for learning model parameters in bayesian knowledge tracing. In Workshop Approaching Twenty Years of Knowledge Tracing (BKT20y), S. G. Santos and O. C. Santos, Eds. CEUR-WS, 153–155.

DING, X. AND LARSON, E. C. 2019. Why deep knowledge tracing has less depth than anticipated. In Proceedings of the 12th International Conference on Educational Data Mining, C. F. Lynch, A. Merceron, M. Desmarais, and R. Nkambou, Eds. International Educational Data Mining Society, 282– 287.

DOZAT, T. 2016. Incorporating nesterov momentum into adam. http://cs229.stanford.edu/proj2015/054 report.pdf .

EFFENBERGER, T. AND PELÁNEK, R. 2020. Impact of methodological choices on the evaluation of student models. In International Conference on Artificial Intelligence in Education, I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, and E. Mill´an, Eds. Springer, 153–164.

FANELLI, D. 2011. Negative results are disappearing from most disciplines and countries. Scientometrics 90, 3, 891–904.

FARAWAY, J. J. AND AUGUSTIN, N. H. 2018. When small data beats big data. Statistics & Probability Letters 136, 142–145.

FAWCETT, T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 8, 861–874.

FENG, M., HEFFERNAN, N., AND KOEDINGER, K. 2009. Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction 19, 3, 243–266.

FERRI, C., HERNÀNDEZ-ORALLO, J., AND MODROIU, R. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1, 27–38.

GAL, Y. AND GHAHRAMANI, Z. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 16, S. Thrun, L. Saul, and B. Sch¨olkopf, Eds. Curran Associates Inc., Red Hook, NY, USA, 1027–1035.

GARDNER, J., YANG, Y., BAKER, R. S., AND BROOKS, C. 2019. Modeling and experimental design for mooc dropout prediction: A replication perspective. In Proceedings of The 12th International Conference on Educational Data Mining, C. F. Lynch, A. Merceron, M. Desmarais, and R. Nkambou, Eds. International Educational Data Mining Society.

GERS, F. A. AND SCHMIDHUBER, E. 2001. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks 12, 6, 1333–1340.

GERVET, T., KOEDINGER, K., SCHNEIDER, J., AND MITCHELL, T. 2020. When is deep learning the best approach to knowledge tracing? Journal of Educational Data Mining 12, 3, 31–54.

GHAHRAMANI, Z. 1997. Learning dynamic bayesian networks. In International School on Neural Networks, Initiated by IIASS and EMFCSC. Springer, 168–197.

GHOSH, A., HEFFERNAN, N., AND LAN, A. S. 2020. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, M. Shah and S. Rajan, Eds. Association for Computing Machinery, 2330–2339.

GILBERT, D. T., KING, G., PETTIGREW, S., AND WILSON, T. 2016. Comment on “estimating the reproducibility of psychological science”. Science 351, 6277, 1037.

GOLDEN, M. A. 1995. Replication and non-quantitative research. PS: Political Science & Politics 28, 03, 481–483.

GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. 2010. Comparing knowledge tracing and performance factor analysis by using multiple model fitting procedures. In International Conference on Intelligent Tutoring Systems, V. Aleven, J. Kay, and J. Mostow, Eds. Springer, 35–44.

GONZÁLEZ-BRENES, J., HUANG, Y., AND BRUSILOVSKY, P. 2014. General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge. In Proceedings of the 7th International Conference on Educational Data Mining, J. C. Stamper, Z. A. Pardos, M. Mavrikis, and B. M. McLaren, Eds. International Educational Data Mining Society, 84–91.

GRAVES, A. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.

GRAVES, A., WAYNE, G., AND DANIHELKA, I. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.

GUNAWARDANA, A. AND SHANI, G. 2009. A survey of accuracy evaluation metrics of recommendation tasks. Journal of Machine Learning Research 10, 12, 2935–2962.

HALIMU, C., KASEM, A., AND NEWAZ, S. S. 2019. Empirical comparison of area under ROC curve (AUC) and mathew correlation coefficient (mcc) for evaluating machine learning algorithms on imbalanced datasets for binary classification. In Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, T. Le-Tien, P. Terenziani, and D. Minh-Son, Eds. Association for Computing Machinery, 1–6.

HAMBLETON, R. K. AND SWAMINATHAN, H. 1985. Item response theory: Principles and applications. Springer.

HEFFERNAN, N. T. AND HEFFERNAN, C. L. 2014. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education 24, 4, 470–497.

HEFFERNAN, N. T., TURNER, T. E., LOURENCO, A. L., MACASEK, M. A., NUZZO-JONES, G., AND KOEDINGER, K. R. 2006. The assistment builder: Towards an analysis of cost effectiveness of its creation. In Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference Conference, G. C. J. Sutcliffe and R. G. Goebel, Eds. AAAI, 515–520.

HOCHREITER, S. AND SCHMIDHUBER, J. 1997a. Long short-term memory. Neural Computation 9, 8, 1735–1780.

HOCHREITER, S. AND SCHMIDHUBER, J. 1997b. LSTM can solve hard long time lag problems. In Advances in Neural Information Processing Systems 9, M. Mozer, M. Jordan, and T. Petsche, Eds. MIT Press, 473–479.

HUANG, J. AND LING, C. X. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering 17, 3, 299–310.

HUANG, L., SHEA, A. L., QIAN, H., MASURKAR, A., DENG, H., AND LIU, D. 2019. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. Journal of Biomedical Informatics 99, 103291.

IHANTOLA, P., VIHAVAINEN, A., AHADI, A., BUTLER, M., BÖRSTLER, J., EDWARDS, S. H., ISOHANNI, E., KORHONEN, A., PETERSEN, A., RIVERS, K., RUBIO, M. A., SHEARD, J., SKUPAS, B., SPACCO, J., SZABO, C., AND TOLL, D. 2015. Educational data mining and learning analytics in programming: Literature review and case studies. In Proceedings of the 2015 ITiCSE on Working Group Reports, N. Ragonis and P. Kinnunen, Eds. Association for Computing Machinery, New York, NY, USA, 41–63.

IOANNIDIS, J. P. 2005a. Contradicted and initially stronger effects in highly cited clinical research. Jama 294, 2, 218–228.

IOANNIDIS, J. P. 2005b. Why most published research findings are false. PLoS Medicine 2, 8, e124.

IOANNIDIS, J. P., MUNAFO, M. R., FUSAR-POLI, P., NOSEK, B. A., AND DAVID, S. P. 2014. Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in Cognitive Sciences 18, 5, 235–241.

JENI, L. A., COHN, J. F., AND DE LA TORRE, F. 2013. Facing imbalanced data–recommendations for the use of performance metrics. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, A. Niholt, M. Pantic, and S. D’Mello, Eds. IEEE, 245–251.

JOHNS, J., MAHADEVAN, S., AND WOOLF, B. 2006. Estimating student proficiency using an item response theory model. In International Conference on Intelligent Tutoring Systems, M. Ikeda, K. D. Ashley, and T.-W. Chan, Eds. Springer, 473–480.

KAMIJO, K. AND TANIGAWA, T. 1990. Stock price pattern recognition-a recurrent neural network approach. In Proceedings of the International Joint Conference on Neural Networks, W. Freeman and B. Kosko, Eds. Vol. 1. IEEE, 215–221.

KÄSER, T., KLINGLER, S., SCHWING, A. G., AND GROSS, M. 2014. Beyond knowledge tracing: Modeling skill topologies with bayesian networks. In International Conference on Intelligent Tutoring Systems, S. Trausan-Matu, K. E. Boyer, M. Crosby, and K. Panourgia, Eds. Springer, 188–198.

KHAJAH, M., LINDSEY, R. V., AND MOZER, M. C. 2016. How deep is knowledge tracing? In Proceedings of the 9th International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. International Educational Data Mining Society, 94–101.

KHAJAH, M.,WING, R., LINDSEY, R., AND MOZER, M. 2014. Integrating latent-factor and knowledge tracing models to predict individual differences in learning. In Proceedings of the 7th International Conference on Educational Data Mining, J. C. Stamper, Z. A. Pardos, M. Mavrikis, and B. M. McLaren, Eds. International Educational Data Mining Society, 99–106.

KHAJAH, M. M., HUANG, Y., GONZÁLEZ-BRENES, J. P., MOZER, M. C., AND BRUSILOVSKY, P. 2014. Integrating knowledge tracing and item response theory: A tale of two frameworks. In CEUR Workshop Proceedings, I. Cantador, M. Chi, R. Farzan, and R. J¨aschke, Eds. Vol. 1181. CEUR-WS, 7–15.

KIM, S. J., CHO, K. J., AND OH, S. 2017. Development of machine learning models for diagnosis of glaucoma. PLoS One 12, 5, e0177726.

KOEDINGER, K. R., BAKER, R. S., CUNNINGHAM, K., SKOGSHOLM, A., LEBER, B., AND STAMPER, J. 2010. A data repository for the EDM community: The PSLC DataShop. Handbook of Educational Data Mining 43, 43–56.

LALWANI, A. AND AGRAWAL, S. 2017. Few hundred parameters outperform few hundred thousand. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. Vol. 17. International Educational Data Mining Society, 448– 453.

LECUN, Y., BENGIO, Y., AND HINTON, G. 2015. Deep learning. nature 521, 7553, 436–444.

LIN, C. AND CHI, M. 2016. Intervention-BKT: incorporating instructional interventions into bayesian knowledge tracing. In International Conference on Intelligent Tutoring Systems, A. Micarelli, J. Stamper, and K. Panourgia, Eds. Springer, 208–218.

LIN, C. AND CHI, M. 2017. A comparisons of BKT, RNN and LSTM for learning gain prediction. In International Conference on Artificial Intelligence in Education, E. Andr´e, R. Baker, X. Hu, M. M. T. Rodrigo, and B. du Boulay, Eds. Springer, 536–539.

LING, C. X., HUANG, J., AND ZHANG, H. 2003. AUC: a statistically consistent and more discriminating measure than accuracy. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. Vol. 3. Morgan Kaufmann, 519–524.

LIPTON, Z. C. AND STEINHARDT, J. 2019. Troubling trends in machine learning scholarship: Some ml papers suffer from flaws that could mislead the public and stymie future research. Queue 17, 1 (feb), 45–77.

LIU, C., WHITE, M., AND NEWELL, G. 2011. Measuring and comparing the accuracy of species distribution models with presence–absence data. Ecography 34, 2, 232–243.

LIU, Q., HUANG, Z., YIN, Y., CHEN, E., XIONG, H., SU, Y., AND HU, G. 2019. EKT: Exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering 33, 1, 100–115.

LIU, Q., TONG, S., LIU, C., ZHAO, H., CHEN, E., MA, H., AND WANG, S. 2019. Exploiting cognitive structure for adaptive learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Y. Li, R. Rosales, E. Terzi, and G. Karypis, Eds. Association for Computing Machinery, 627–635.

LOBO, J. M., JIMÉNEZ-VALVERDE, A., AND REAL, R. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography 17, 2, 145–151.

LUONG, T., PHAM, H., AND MANNING, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, L. M`arquez, C. Callison-Burch, and J. Su, Eds. Association for Computational Linguistics, Lisbon, Portugal, 1412–1421.

MA, W., ADESOPE, O. O., NESBIT, J. C., AND LIU, Q. 2014. Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology 106, 4, 901.

MACKEY, A. 2012. Why (or why not), when and how to replicate research. Replication research in applied linguistics 2146, 21–46.

MAKEL, M. C., PLUCKER, J. A., AND HEGARTY, B. 2012. Replications in psychology research how often do they really occur? Perspectives on Psychological Science 7, 6, 537–542.

MANDALAPU, V., GONG, J., AND CHEN, L. 2021. Do we need to go deep? knowledge tracing with big data. In AAAI-2021 Workshop on AI Education: “Imagining Post-COVID Education with AI”, N. Heffernan and P. Kim, Eds. AAAI. Available at https://arxiv.org/abs/2101.08349.

MAO, Y., LIN, C., AND CHI, M. 2018. Deep learning vs. bayesian knowledge tracing: Student models for interventions. Journal of Educational Data Mining 10, 2, 28–54.

MIKOLOV, T., KOMBRINK, S., BURGET, L., ČERNOCKY , J., AND KHUDANPUR, S. 2011. Extensions of recurrent neural network language model. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. 5528–5531.

MILLER, A., FISCH, A., DODGE, J., KARIMI, A.-H., BORDES, A., AND WESTON, J. 2016. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, and X. Carreras, Eds. Association for Computational Linguistics, Austin, Texas, 1400–1409.

MOLNAR, C. 2020. Interpretable machine learning. Leanpub.

MONTERO, S., ARORA, A., KELLY, S., MILNE, B., AND MOZER, M. 2018. Does deep knowledge tracing model interactions among skills? In Proceedings of the 11th International Conference on Educational Data Mining, K. E. Boyer and M. Yudelson, Eds. International Educational Data Mining Society, 462–466.

MOONESINGHE, R., KHOURY, M. J., AND JANSSENS, A. C. J. 2007. Most published research findings are false – but a little replication goes a long way. PLoS Medicine 4, 2, e28.

MUMA, J. R. 1993. The need for replication. Journal of Speech, Language, and Hearing Research 36, 5, 927–930.

MUSCHELLI, J. 2020. Roc and auc with a binary predictor: a potentially misleading metric. Journal of Classification 37, 3, 696–708.

NAKAGAWA, H., IWASAWA, Y., AND MATSUO, Y. 2019. Graph-based knowledge tracing: Modeling student proficiency using graph neural network. In IEEE/WIC/ACM International Conference on Web Intelligence, P. Barnaghi, G. Gottlob, Y. Manolopoulos, T. Tzouramanis, and A. Vakali, Eds. WI ’19. Association for Computing Machinery, 156–163.

NATIONAL RESEARCH COUNCIL AND CLIMATE RESEARCH COMMITTEE. 2005. Chapter 3: Principles for developing metrics. In Thinking Strategically: The Appropriate Use of Metrics for the Climate Change Science Program. The National Academies Press, 47–54.

NWANA, H. S. 1990. Intelligent tutoring systems: an overview. Artificial Intelligence Review 4, 4, 251–277.

OPEN SCIENCE COLLABORATION. 2015. Estimating the reproducibility of psychological science. Science 349, 6251, aac4716.

OYA, T. AND MORISHIMA, S. 2021. LSTM-SAKT: LSTM-encoded SAKT-like transformer for knowledge tracing, 2nd place solution for riiid! answer correctness prediction. arXiv preprint arXiv:2102.00845.

OZENNE, B., SUBTIL, F., AND MAUCORT-BOULCH, D. 2015. The precision–recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. Journal of Clinical Epidemiology 68, 8, 855–859.

PAHIKKALA, T., PYYSALO, S., BOBERG, J., JÄRVINEN, J., AND SALAKOSKI, T. 2009. Matrix representations, linear transformations, and kernels for disambiguation in natural language. Machine Learning 74, 2, 133–158.

PANDEY, S. AND KARYPIS, G. 2019. A self-attentive model for knowledge tracing. In Proceedings of The 12th International Conference on Educational Data Mining, C. F. Lynch, A. Merceron, M. Desmarais, and R. Nkambou, Eds. International Educational Data Mining Society, 384–389.

PANDEY, S., KARYPIS, G., AND SRIVASTAVA, J. 2021. An empirical comparison of deep learning models for knowledge tracing on large-scale dataset. arXiv preprint arXiv:2101.06373.

PANDEY, S. AND SRIVASTAVA, J. 2020. RKT: Relation-aware self-attention for knowledge tracing. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, C. Hauff, E. Curry, and P. C. Mauroux, Eds. Association for Computing Machinery, 1205–1214.

PARDOS, Z., HEFFERNAN, N., RUIZ, C., AND BECK, J. 2008. The composition effect: Conjuntive or compensatory? an analysis of multi-skill math questions in its. In The 1st International Conference on Educational Data Mining, R. Baker, T. Barnes, and J. Beck, Eds. International Educational Data Mining Society, 147–156.

PARDOS, Z. A. AND HEFFERNAN, N. T. 2010. Modeling individualization in a bayesian networks implementation of knowledge tracing. In International Conference on User Modeling, Adaptation, and Personalization, P. Bra, A. Kobsa, and D. Chin, Eds. Springer, 255–266.

PARDOS, Z. A. AND HEFFERNAN, N. T. 2011. KT-IDEM: Introducing item difficulty to the knowledge tracing model. In International Conference on User Modeling, Adaptation, and Personalization, J. A. Konstan, R. Conejo, J. L. Marzo, and N. Oliver, Eds. Springer, 243–254.

PATIL, P., PENG, R. D., AND LEEK, J. T. 2016. A statistical definition for reproducibility and replicability. BioRxiv, 066803.

PAVLIK, P. I., CEN, H., AND KOEDINGER, K. R. 2009. Performance factors analysis –a new alternative to knowledge tracing. In Proceedings of the 2009 Conference on Artificial Intelligence in Education: Building Learning Systems That Care: From Knowledge Representation to Affective Modelling, V. Dimitrova, R. Mizoguchi, B. du Boulay, and A. Graesser, Eds. IOS Press, NLD, 531–538.

PEARL, J. 1985. Bayesian networks: a model of self-activated memory for evidential reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, R. H. J. Granger and K. Eiselt, Eds. Cognitive Science Society, 15–17.

PEARL, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.

PELÁNEK, R. 2014. Application of time decay functions and the elo system in student modeling. In Proceedings of the 7th International Conference on Educational Data Mining, J. C. Stamper, Z. A.

Pardos, M. Mavrikis, and B. M. McLaren, Eds. International Educational Data Mining Society, 21–27.

PELÁNEK, R. 2015. Metrics for evaluation of student models. Journal of Educational Data Mining 7, 2, 1–19.

PENG, R. D. 2011. Reproducible research in computational science. Science 334, 6060, 1226–1227.

PIECH, C., BASSEN, J., HUANG, J., GANGULI, S., SAHAMI, M., GUIBAS, L. J., AND SOHLDICKSTEIN, J. 2015. Deep knowledge tracing. In Advances in Neural Information Processing Systems 28, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 505–513.

PU, S., YUDELSON, M., OU, L., AND HUANG, Y. 2020. Deep knowledge tracing with transformers. In International Conference on Artificial Intelligence in Education, I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, and E. Mill´an, Eds. Springer, 252–256.

RAFFERTY, A. N., BRUNSKILL, E., GRIFFITHS, T. L., AND SHAFTO, P. 2011. Faster teaching by pomdp planning. In International Conference on Artificial Intelligence in Education, G. Biswas, S. Bull, J. Kay, and A. Mitrovic, Eds. Springer, 280–287.

ROWE, J. P., AND LESTER, J. C. 2010. Modeling user knowledge with dynamic bayesian networks in interactive narrative environments. In Proceedings of the 6th Artificial Intelligence and Interactive Digital Entertainment Conference, G. M. Youngblood and V. Bulitko, Eds. AAAI, 57–62.

SAAD, E. W., PROKHOROV, D. V., AND WUNSCH, D. C. 1998. Comparative study of stock trend prediction using time delay, recurrent and probabilistic neural networks. IEEE Transactions on Neural Networks 9, 6, 1456–1470.

SAHA, S. AND RAGHAVA, G. P. S. 2006. Prediction of continuous b-cell epitopes in an antigen using recurrent neural network. Proteins: Structure, Function, and Bioinformatics 65, 1, 40–48.

SAITO, T. AND REHMSMEIER, M. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10, 3, e0118432.

SANYAL, D., BOSCH, N., AND PAQUETTE, L. 2020. Feature selection metrics: Similarities, differences, and characteristics of the selected models. In Proceedings of the 13th International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. International Educational Data Mining Society, 212–223.

SHIN, D., SHIM, Y., YU, H., LEE, S., KIM, B., AND CHOI, Y. 2021. Saint+: Integrating temporal features for ednet correctness prediction. In Proceedings of the 11th International Learning Analytics and Knowledge Conference, M. Scheffel, N. Dowell, S. Joksimovic, and G. Siemens, Eds. LAK21. Association for Computing Machinery, New York, NY, USA, 490–496.

SONG, X., LI, J., TANG, Y., ZHAO, T., CHEN, Y., AND GUAN, Z. 2021. JKT: A joint graph convolutional network based deep knowledge tracing. Information Sciences 580, 510–523.

SPELLMAN, B. A. 2012. Introduction to the special section data, data, everywhere... especially in my file drawer. Perspectives on Psychological Science 7, 1, 58–59.

STEIF, P. AND BIER, N. 2014. OLI engineering statics-fall 2011, February 2014. https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507.

STEVENS, J. R. 2017. Replicability and reproducibility in comparative psychology. Frontiers in Psychology 8, 862.

SU, Y., LIU, Q., LIU, Q., HUANG, Z., YIN, Y., CHEN, E., DING, C., WEI, S., AND HU, G. 2018. Exercise-enhanced sequential modeling for student performance prediction. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, S. A. McIlraith and K. Q. Weinberger, Eds. Number 1. AAAI, 2435–2443.

SUKHBAATAR, S., SZLAM, A., WESTON, J., AND FERGUS, R. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2440–2448.

TRIFA, A., HEDHILI, A., AND CHAARI, W. L. 2019. Knowledge tracing with an intelligent agent, in an e-learning platform. Education and Information Technologies 24, 1, 711–741.

VANLEHN, K. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist 46, 4, 197–221.

VASWANI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A. N., KAISER, L., AND POLOSUKHIN, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 31, S. Bengio, H.Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates Inc., Red Hook, NY, USA, 6000–6010.

VIE, J.-J. AND KASHIMA, H. 2019. Knowledge tracing machines: Factorization machines for knowledge tracing. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, P. V. Hentenryck and Z.-H. Zhou, Eds. Number 1. AAAI, 750–757.

WESTON, J., CHOPRA, S., AND BORDES, A. 2015. Memory networks. In 3rd International Conference on Learning Representations, Y. Bengio and Y. LeCun, Eds. Available at https://arxiv.org/abs/1410.3916.

WILSON, K. H., XIONG, X., KHAJAH, M., LINDSEY, R. V., ZHAO, S., KARKLIN, Y., VAN INWEGEN, E. G., HAN, B., EKANADHAM, C., BECK, J. E., HEFFERNAN, N., AND MOZER, M. C. 2016. Estimating student proficiency: Deep learning is not the panacea. In Neural Information Processing Systems, Workshop on Machine Learning for Education, R. G. Baraniuk, J. Ngiam, C. Studer, P. Grimaldi, and A. S. Lan, Eds.

XIONG, X., ZHAO, S., VAN INWEGEN, E. G., AND BECK, J. E. 2016. Going deeper with deep knowledge tracing. In Proceedings of the 9th International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. International Educational Data Mining Society, 545–550.

YEUNG, C.-K. AND YEUNG, D.-Y. 2018. Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, S. Klemmer and K. Koedinger, Eds. Association for Computing Machinery, 1–10.

YUDELSON, M., FANCSALI, S., RITTER, S., BERMAN, S., NIXON, T., AND JOSHI, A. 2014. Better data beats big data. In Proceedings of the 7th International Conference on Educational Data Mining, J. C. Stamper, Z. A. Pardos, M. Mavrikis, and B. M. McLaren, Eds. International Educational Data Mining Society, 205–209.

YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J. 2013. Individualized bayesian knowledge tracing models. In International Conference on Artificial Intelligence in Education, H. C. Lane, K. Yacef, J. Mostow, and P. Pavlik, Eds. Springer, 171–180.

ZHANG, J., SHI, X., KING, I., AND YEUNG, D.-Y. 2017. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, E. Agichtein and E. Gabrilovich, Eds.WWW’17. InternationalWorldWideWeb Conferences Steering Committee, Republic and Canton of Geneva, CHE, 765–774.

ZHUANG, F., QI, Z., DUAN, K., XI, D., ZHU, Y., ZHU, H., XIONG, H., AND HE, Q. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE 109, 1, 43–76.
Section
(past) EDM 2022 Journal Track