Metrics for Evaluation of Student Models



Published Jun 2, 2015
Radek Pelánek


Researchers use many different metrics for evaluation of performance of student models. The aim of this paper is to provide an overview of commonly used metrics, to discuss properties, advantages, and disadvantages of different metrics, to summarize current practice in educational data mining, and to provide guidance for evaluation of student models. In the discussion we mention the relation of metrics to parameter fitting, the impact of student models on student practice (over-practice, under-practice), and point out connections to related work on evaluation of probability forecasters in other domains. We also provide an empirical comparison of metrics. One of the conclusion of the paper is that some commonly used metrics should not be used (MAE) or should be used more critically (AUC).

How to Cite

Pelánek, R. (2015). Metrics for Evaluation of Student Models. Journal of Educational Data Mining, 7(2), 1–19.
Abstract 897 | PDF Downloads 1212



metrics for evaluation, performance of student models, parameter fitting, student practice, probability forecasters

ARROYO, I., WOOLF, B. P., BURELSON, W., MULDNER, K., RAI, D., AND TAI, M. 2014. A multimedia adaptive tutoring system for mathematics that addresses cognition, metacognition and affect. International Journal of Artificial Intelligence in Education 24, 4, 387–426.

BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008a. Improving contextual models of guessing and slipping with a truncated training set. In Educational Data Mining. 67–76.

BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008b. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent Tutoring Systems. Springer, 406–415.

BAKER, R. S., CORBETT, A. T., GOWDA, S. M., WAGNER, A. Z., MACLAREN, B. A., KAUFFMAN, L. R., MITCHELL, A. P., AND GIGUERE, S. 2010. Contextual slip and prediction of student performance after use of an intelligent tutor. In User Modeling, Adaptation, and Personalization. Springer, 52–63.

BAKER, R. S., CORBETT, A. T., AND KOEDINGER, K. R. 2004. Detecting student misuse of intelligent tutoring systems. In Intelligent tutoring systems. Springer Berlin Heidelberg, 531–540.

BAKER, R. S., CORBETT, A. T., ROLL, I., AND KOEDINGER, K. R. 2008. Developing a generalizable detector of when students game the system. User Modeling and User-Adapted Interaction 18, 3, 287–314.


KUSBIT, G. W., OCUMPAUGH, J., AND ROSSI, L. 2012. Sensor-free affect detection in cognitive tutor algebra. In Educational Data Mining. 126–133.

BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 1, 3–17.

BARNES, T. 2005. The q-matrix method: Mining student response data for knowledge. In American Association for Artificial Intelligence 2005 Educational Data Mining Workshop.

BECK, J. E. AND CHANG, K.-M. 2007. Identifiability: A fundamental problem of student modeling. In User Modeling 2007. Springer, 137–146.

BECK, J. E. AND MOSTOW, J. 2008. How who should practice: Using learning decomposition to evaluate the efficacy of different types of practice for different types of students. In Intelligent Tutoring Systems. Springer, 353–362.

BECK, J. E. AND XIONG, X. 2013. Limits to accuracy: How well can we do at student modeling. In Educational Data Mining. 4–11.

BISHOP, C. 2006. Pattern recognition and machine learning. Springer.

BRIER, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1, 1–3. BR¨OCKER, J. AND SMITH, L. A. 2007. Increasing the reliability of reliability diagrams. Weather and forecasting 22, 3, 651–661.

BULL, S. 2004. Supporting learning with open learner models. In Information and Communication Technologies in Education.

CARUANA, R. AND NICULESCU-MIZIL, A. 2004. Data mining in metric space: an empirical analysis of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 69–78.

CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning factors analysis–a general method for cognitive model evaluation and improvement. In Intelligent Tutoring Systems. Springer, 164–175.

CEN, H., KOEDINGER, K. R., AND JUNKER, B. 2007. Is over practice necessary?-improving learning efficiency with the cognitive tutor through educational data mining. Frontiers in Artificial Intelligence and Applications 158, 511.

COCEA, M., HERSHKOVITZ, A., AND BAKER, R. S. 2009. The impact of off-task and gaming behaviors on learning: immediate or aggregate? In Artificial Intelligence in Education. 507–514.

COHEN, I. AND GOLDSZMIDT, M. 2004. Properties and benefits of calibrated classifiers. In Knowledge Discovery in Databases: PKDD 2004. Springer, 125–136.

CONATI, C. AND MACLAREN, H. 2009. Empirically building and evaluating a probabilistic model of user affect. User Modeling and User-Adapted Interaction 19, 3, 267–303.

CORBETT, A. AND ANDERSON, J. 1995. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4, 4, 253–278.

DESMARAIS, M. C. AND DE BAKER, R. S. J. 2012. A review of recent advances in learner and skill modeling in intelligent learning environments. User Model. User-Adapt. Interact. 22, 1-2, 9–38.

DHANANI, A., LEE, S. Y., PHOTHILIMTHANA, P., AND PARDOS, Z. 2014. A comparison of error metrics for learning model parameters in bayesian knowledge tracing. Tech. rep., Technical Report UCB/EECS-2014-131, EECS Department, University of California, Berkeley.

D’MELLO, S. K., CRAIG, S. D., WITHERSPOON, A., MCDANIEL, B., AND GRAESSER, A. 2008. Automatic detection of learner’s affect from conversational cues. User modeling and user-adapted interaction 18, 1-2, 45–80.

DODD, L. E. AND PEPE, M. S. 2003. Partial AUC estimation and regression. Biometrics 59, 3, 614–623.

FANCSALI, S. E., NIXON, T., AND RITTER, S. 2013. Optimal and worst-case performance of mastery learning assessment with bayesian knowledge tracing. In Proceedings of the 6th International Conference on Educational Data Mining.

FANCSALI, S. E., NIXON, T., VUONG, A., AND RITTER, S. 2013. Simulated students, mastery learning, and improved learning curves for real-world cognitive tutors. In AIED Workshops. Citeseer.

FAWCETT, T. 2006. An introduction to roc analysis. Pattern recognition letters 27, 8, 861–874.

FERRI, C., HERN´A NDEZ-ORALLO, J., AND MODROIU, R. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1, 27–38.

FOGARTY, J., BAKER, R. S., AND HUDSON, S. E. 2005. Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. In Proc. of Graphics Interface 2005. 129– 136.

GNEITING, T. AND RAFTERY, A. E. 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 477, 359–378.

GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. 2010. Comparing knowledge tracing and performance factor analysis by using multiple model fitting procedures. In Intelligent Tutoring Systems. Springer, 35–44.

GONZ´A LEZ-BRENES, J., HUANG, Y., AND BRUSILOVSKY, P. 2014. General features in knowledge tracing: Applications to multiple subskills, temporal item response theory, and expert knowledge. In Proc. of Educational Data Mining. 84–91.

GONZ´A LEZ-BRENES, J. P. AND MOSTOW, J. 2013. What and when do students learn? fully data-driven joint estimation of cognitive and student models. In Proceedings of the 6th International Conference on Educational Data Mining. 236–240.

HAMILL, T. M. AND JURAS, J. 2006. Measuring forecast skill: is it real skill or is it the varying climatology? Quarterly Journal of the Royal Meteorological Society 132, 621C, 2905–2923.

HAND, D. J. 2009. Measuring classifier performance: a coherent alternative to the area under the roc curve. Machine learning 77, 1, 103–123.

HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1, 5–53.

HERSHKOVITZ, A., DE BAKER, R. S. J., GOBERT, J., WIXON, M., AND SAO PEDRO, M. 2013. Discovery with models: A case study on carelessness in computer-based science inquiry. American Behavioral Scientist 57, 10, 1480–1499.

JARUˇS EK, P. AND PEL´A NEK, R. 2012. Analysis of a simple model of problem solving times. In Proc. of Intelligent Tutoring Systems. LNCS, vol. 7315. Springer, 379–388.

JENI, L. A., COHN, J. F., AND DE LA TORRE, F. 2013. Facing imbalanced data–recommendations for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 245–251.

JEWSON, S. 2003. Use of the likelihood for measuring the skill of probabilistic forecasts. arXiv preprint physics/0308046. K¨ASER, T., KOEDINGER, K. R., AND GROSS, M. 2014. Different parameters-same prediction: An analysis of learning curves. In Proc. of Educational Data Mining. 52–59.

KHAJAH, M., WING, R. M., LINDSEY, R. V., AND MOZER, M. C. 2014. Integrating latent-factor and knowledge-tracing models to predict individual differences in learning. In Proc. of Educational Data Mining.

KLINKENBERG, S., STRAATEMEIER, M., AND VAN DER MAAS, H. 2011. Computer adaptive practice of maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education 57, 2, 1813–1824.

LEE, J. I. AND BRUNSKILL, E. 2012. The impact on individualizing student models on necessary practice opportunities. In Educational Data Mining. 118–125.

LIU, C., WHITE, M., AND NEWELL, G. 2011. Measuring and comparing the accuracy of species distribution models with presence–absence data. Ecography 34, 2, 232–243.

LIU, R., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2014. Interpreting model discovery and testing generalization to a new dataset. In Educational Data Mining. 107–113.

LOBO, J. M., JIM´ENEZ-VALVERDE, A., AND REAL, R. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography 17, 2, 145–151.

MURPHY, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology 12, 4, 595–600.

NICULESCU-MIZIL, A. AND CARUANA, R. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.

NIˇZ NAN, J., PEL´ANEK, R., AND PAPOUˇS EK, J. 2015. Exploring the role of small differences in predictive accuracy using simulated data. In AIED Workshop on Simulated Learners.

NIˇZ NAN, J., PEL´A NEK, R., AND ˇR IH´A K, J. 2015. Student models for prior knowledge estimation. In Educational Data Mining.

PAPOUˇSEK, J., PEL´A NEK, R., AND STANISLAV, V. 2014. Adaptive practice of facts in domains with varied prior knowledge. In Educational Data Mining. 6–13.

PARDOS, Z. A., BERGNER, Y., SEATON, D. T., AND PRITCHARD, D. E. 2013. Adapting bayesian knowledge tracing to a massive open online course in edx. In Educational Data Mining. 137–144.

PARDOS, Z. A., GOWDA, S. M., BAKER, R. S., AND HEFFERNAN, N. T. 2012. The sum is greater than the parts: ensembling models of student knowledge in educational software. ACM SIGKDD explorations newsletter 13, 2, 37–44.

PARDOS, Z. A. AND HEFFERNAN, N. T. 2010. Modeling individualization in a bayesian networks implementation of knowledge tracing. In User Modeling, Adaptation, and Personalization. Springer, 255–266.

PARDOS, Z. A. AND HEFFERNAN, N. T. 2011. Kt-idem: Introducing item difficulty to the knowledge tracing model. In User Modeling, Adaption and Personalization. Springer, 243–254.

PARDOS, Z. A. AND YUDELSON, M. V. 2013. Towards moment of learning accuracy. In AIED 2013 Workshops Proceedings Volume 4. 3.

PAVLIK, P. I., CEN, H., AND KOEDINGER, K. R. 2009. Performance factors analysis-a new alternative to knowledge tracing. In Proc. of Artificial Intelligence in Education (AIED). Frontiers in Artificial Intelligence and Applications, vol. 200. IOS Press, 531–538.

PEL´ANEK, R. 2014. Time decay functions and elo system in student modeling. In Educational Data Mining. 21–27. PEL´ANEK, R. 2015. Modeling student learning: Binary or continuous skill? In Educational Data Mining.

QIU, Y., QI, Y., LU, H., PARDOS, Z. A., AND HEFFERNAN, N. T. 2011. Does time matter? modeling the effect of time with bayesian knowledge tracing. In Educational Data Mining. 139–148.

ROULSTON, M. S. AND SMITH, L. A. 2002. Evaluating probabilistic forecasts using information theory. Monthly Weather Review 130, 6.

SAN PEDRO, M. O. Z., BAKER, R. S., GOWDA, S. M., AND HEFFERNAN, N. T. 2013. Towards an understanding of affect and knowledge from student interaction with an intelligent tutoring system. In Artificial Intelligence in Education. Springer, 41–50.

SAO PEDRO, M. A., BAKER, R. S., AND GOBERT, J. D. 2013. Incorporating scaffolding and tutor context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data Mining. 185–192.

STAMPER, J. C., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2013. A comparison of model selection metrics in datashop. In Educational Data Mining. 284–287.

TOTH, Z., TALAGRAND, O., CANDILLE, G., AND ZHU, Y. 2003. Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, Chapter Probability and ensemble forecasts, 137–163.

WANG, Y. AND BECK, J. 2013. Class vs. student in a bayesian network student model. In Artificial Intelligence in Education. Springer, 151–160.

WANG, Y. AND HEFFERNAN, N. 2013. Extending knowledge tracing to allow partial credit: using continuous versus binary nodes. In Artificial Intelligence in Education. Springer, 181–188.

YUDELSON, M. V. AND KOEDINGER, K. R. 2013. Estimating the benefits of student model improvements on a substantive scale. In EDM 2013 Workshops Proceedings.

YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J. 2013. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education. Springer, 171–180.
EDM 2015 Journal Track