Metrics for Evaluation of Student Models



Published Jun 2, 2015
Radek Pelánek



Researchers use many different metrics for evaluation of performance of student models. The aim of this paper is to provide an overview of commonly used metrics, to discuss properties, advantages, and disadvantages of different metrics, to summarize current practice in educational data mining, and to provide guidance for evaluation of student models. In the discussion we mention the relation of metrics to parameter fitting, the impact of student models on student practice (over-practice, under-practice), and point out connections to related work on evaluation of probability forecasters in other domains. We also provide an empirical comparison of metrics. One of the conclusion of the paper is that some commonly used metrics should not be used (MAE) or should be used more critically (AUC).

How to Cite

Pelánek, R. (2015). Metrics for Evaluation of Student Models. JEDM | Journal of Educational Data Mining, 7(2), 1-19. Retrieved from
Abstract 599 | PDF Downloads 721


ARROYO, I., WOOLF, B. P., BURELSON, W., MULDNER, K., RAI, D., AND TAI, M. 2014. A multimedia
adaptive tutoring system for mathematics that addresses cognition, metacognition and affect.
International Journal of Artificial Intelligence in Education 24, 4, 387–426.

BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008a. Improving contextual models of guessing
and slipping with a truncated training set. In Educational Data Mining. 67–76.

BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008b. More accurate student modeling through
contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent
Tutoring Systems. Springer, 406–415.

L. R., MITCHELL, A. P., AND GIGUERE, S. 2010. Contextual slip and prediction of student performance
after use of an intelligent tutor. In User Modeling, Adaptation, and Personalization. Springer,

BAKER, R. S., CORBETT, A. T., AND KOEDINGER, K. R. 2004. Detecting student misuse of intelligent
tutoring systems. In Intelligent tutoring systems. Springer Berlin Heidelberg, 531–540.

BAKER, R. S., CORBETT, A. T., ROLL, I., AND KOEDINGER, K. R. 2008. Developing a generalizable
detector of when students game the system. User Modeling and User-Adapted Interaction 18, 3,

KUSBIT, G. W., OCUMPAUGH, J., AND ROSSI, L. 2012. Sensor-free affect detection in cognitive
tutor algebra. In Educational Data Mining. 126–133.

BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future
visions. Journal of Educational Data Mining 1, 1, 3–17.
BARNES, T. 2005. The q-matrix method: Mining student response data for knowledge. In American
Association for Artificial Intelligence 2005 Educational Data Mining Workshop.

BECK, J. E. AND CHANG, K.-M. 2007. Identifiability: A fundamental problem of student modeling. In
User Modeling 2007. Springer, 137–146.

BECK, J. E. AND MOSTOW, J. 2008. How who should practice: Using learning decomposition to evaluate
the efficacy of different types of practice for different types of students. In Intelligent Tutoring
Systems. Springer, 353–362.

BECK, J. E. AND XIONG, X. 2013. Limits to accuracy: How well can we do at student modeling. In
Educational Data Mining. 4–11.

BRIER, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review
78, 1, 1–3.

BROCKER, J. AND SMITH, L. A. 2007. Increasing the reliability of reliability diagrams. Weather and
forecasting 22, 3, 651–661.

BULL, S. 2004. Supporting learning with open learner models. In Information and Communication Technologies
in Education.

CARUANA, R. AND NICULESCU-MIZIL, A. 2004. Data mining in metric space: an empirical analysis
of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 69–78.

CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning factors analysis–a general method for
cognitive model evaluation and improvement. In Intelligent Tutoring Systems. Springer, 164–175.

CEN, H., KOEDINGER, K. R., AND JUNKER, B. 2007. Is over practice necessary?-improving learning
efficiency with the cognitive tutor through educational data mining. Frontiers in Artificial Intelligence
and Applications 158, 511.

COCEA, M., HERSHKOVITZ, A., AND BAKER, R. S. 2009. The impact of off-task and gaming behaviors
on learning: immediate or aggregate? In Artificial Intelligence in Education. 507–514.

COHEN, I. AND GOLDSZMIDT, M. 2004. Properties and benefits of calibrated classifiers. In Knowledge
Discovery in Databases: PKDD 2004. Springer, 125–136.

CONATI, C. AND MACLAREN, H. 2009. Empirically building and evaluating a probabilistic model of
user affect. User Modeling and User-Adapted Interaction 19, 3, 267–303.

CORBETT, A. AND ANDERSON, J. 1995. Knowledge tracing: Modeling the acquisition of procedural
knowledge. User modeling and user-adapted interaction 4, 4, 253–278.

DESMARAIS, M. C. AND DE BAKER, R. S. J. 2012. A review of recent advances in learner and skill
modeling in intelligent learning environments. User Model. User-Adapt. Interact. 22, 1-2, 9–38.

DHANANI, A., LEE, S. Y., PHOTHILIMTHANA, P., AND PARDOS, Z. 2014. A comparison of error
metrics for learning model parameters in bayesian knowledge tracing. Tech. rep., Technical Report
UCB/EECS-2014-131, EECS Department, University of California, Berkeley.

Automatic detection of learner’s affect from conversational cues. User modeling and user-adapted
interaction 18, 1-2, 45–80.

DODD, L. E. AND PEPE, M. S. 2003. Partial AUC estimation and regression. Biometrics 59, 3, 614–623.

FANCSALI, S. E., NIXON, T., AND RITTER, S. 2013. Optimal and worst-case performance of mastery
learning assessment with bayesian knowledge tracing. In Proceedings of the 6th International
Conference on Educational Data Mining.

FANCSALI, S. E., NIXON, T., VUONG, A., AND RITTER, S. 2013. Simulated students, mastery learning,
and improved learning curves for real-world cognitive tutors. In AIED Workshops. Citeseer.

FAWCETT, T. 2006. An introduction to roc analysis. Pattern recognition letters 27, 8, 861–874.

FERRI, C., HERN´A NDEZ-ORALLO, J., AND MODROIU, R. 2009. An experimental comparison of performance
measures for classification. Pattern Recognition Letters 30, 1, 27–38.

FOGARTY, J., BAKER, R. S., AND HUDSON, S. E. 2005. Case studies in the use of ROC curve analysis
for sensor-based estimates in human computer interaction. In Proc. of Graphics Interface 2005. 129–

GNEITING, T. AND RAFTERY, A. E. 2007. Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association 102, 477, 359–378.

GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. 2010. Comparing knowledge tracing and performance
factor analysis by using multiple model fitting procedures. In Intelligent Tutoring Systems. Springer,

GONZALEZ-BRENES, J., HUANG, Y., AND BRUSILOVSKY, P. 2014. General features in knowledge
tracing: Applications to multiple subskills, temporal item response theory, and expert knowledge. In
Proc. of Educational Data Mining. 84–91.

GONZALEZ-BRENES, J. P. AND MOSTOW, J. 2013. What and when do students learn? fully data-driven
joint estimation of cognitive and student models. In Proceedings of the 6th International Conference
on Educational Data Mining. 236–240.

HAMILL, T. M. AND JURAS, J. 2006. Measuring forecast skill: is it real skill or is it the varying climatology?
Quarterly Journal of the Royal Meteorological Society 132, 621C, 2905–2923.

HAND, D. J. 2009. Measuring classifier performance: a coherent alternative to the area under the roc
curve. Machine learning 77, 1, 103–123.

HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative
filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1, 5–53.

Discovery with models: A case study on carelessness in computer-based science inquiry. American
Behavioral Scientist 57, 10, 1480–1499.

JARUSEK, P. AND PELANEK, R. 2012. Analysis of a simple model of problem solving times. In Proc. of
Intelligent Tutoring Systems. LNCS, vol. 7315. Springer, 379–388.

JENI, L. A., COHN, J. F., AND DE LA TORRE, F. 2013. Facing imbalanced data–recommendations
for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013
Humaine Association Conference on. IEEE, 245–251.

JEWSON, S. 2003. Use of the likelihood for measuring the skill of probabilistic forecasts. arXiv preprint

KASER, T., KOEDINGER, K. R., AND GROSS, M. 2014. Different parameters-same prediction: An
analysis of learning curves. In Proc. of Educational Data Mining. 52–59.

KHAJAH, M., WING, R. M., LINDSEY, R. V., AND MOZER, M. C. 2014. Integrating latent-factor and
knowledge-tracing models to predict individual differences in learning. In Proc. of Educational Data

KLINKENBERG, S., STRAATEMEIER, M., AND VAN DER MAAS, H. 2011. Computer adaptive practice
of maths ability using a new item response model for on the fly ability and difficulty estimation.
Computers & Education 57, 2, 1813–1824.

LEE, J. I. AND BRUNSKILL, E. 2012. The impact on individualizing student models on necessary practice
opportunities. In Educational Data Mining. 118–125.

LIU, C., WHITE, M., AND NEWELL, G. 2011. Measuring and comparing the accuracy of species distribution
models with presence–absence data. Ecography 34, 2, 232–243.

LIU, R., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2014. Interpreting model discovery and testing
generalization to a new dataset. In Educational Data Mining. 107–113.

LOBO, J. M., JIMENEZ-VALVERDE, A., AND REAL, R. 2008. AUC: a misleading measure of the performance
of predictive distribution models. Global ecology and Biogeography 17, 2, 145–151.

MURPHY, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology
12, 4, 595–600.

NICULESCU-MIZIL, A. AND CARUANA, R. 2005. Predicting good probabilities with supervised learning.
In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.

NIZNAN, J., PELANEK, R., AND PAPOUSEK, J. 2015. Exploring the role of small differences in predictive
accuracy using simulated data. In AIED Workshop on Simulated Learners.

NIZNAN, J., PELANEK, R., AND RIHAK, J. 2015. Student models for prior knowledge estimation. In
Educational Data Mining.

PAPOUSEK, J., PELANEK, R., AND STANISLAV, V. 2014. Adaptive practice of facts in domains with
varied prior knowledge. In Educational Data Mining. 6–13.

PARDOS, Z. A., BERGNER, Y., SEATON, D. T., AND PRITCHARD, D. E. 2013. Adapting bayesian
knowledge tracing to a massive open online course in edx. In Educational Data Mining. 137–144.

PARDOS, Z. A., GOWDA, S. M., BAKER, R. S., AND HEFFERNAN, N. T. 2012. The sum is greater
than the parts: ensembling models of student knowledge in educational software. ACM SIGKDD
explorations newsletter 13, 2, 37–44.

PARDOS, Z. A. AND HEFFERNAN, N. T. 2010. Modeling individualization in a bayesian networks implementation
of knowledge tracing. In User Modeling, Adaptation, and Personalization. Springer,

PARDOS, Z. A. AND HEFFERNAN, N. T. 2011. Kt-idem: Introducing item difficulty to the knowledge
tracing model. In User Modeling, Adaption and Personalization. Springer, 243–254.

PARDOS, Z. A. AND YUDELSON, M. V. 2013. Towards moment of learning accuracy. In AIED 2013
Workshops Proceedings Volume 4. 3.

PAVLIK, P. I., CEN, H., AND KOEDINGER, K. R. 2009. Performance factors analysis-a new alternative
to knowledge tracing. In Proc. of Artificial Intelligence in Education (AIED). Frontiers in Artificial
Intelligence and Applications, vol. 200. IOS Press, 531–538.

PELANEK, R. 2014. Time decay functions and elo system in student modeling. In Educational Data
Mining. 21–27.

PELANEK, R. 2015. Modeling student learning: Binary or continuous skill? In Educational Data Mining.

QIU, Y., QI, Y., LU, H., PARDOS, Z. A., AND HEFFERNAN, N. T. 2011. Does time matter? modeling
the effect of time with bayesian knowledge tracing. In Educational Data Mining. 139–148.

ROULSTON, M. S. AND SMITH, L. A. 2002. Evaluating probabilistic forecasts using information theory.
Monthly Weather Review 130, 6.

SAN PEDRO, M. O. Z., BAKER, R. S., GOWDA, S. M., AND HEFFERNAN, N. T. 2013. Towards an
understanding of affect and knowledge from student interaction with an intelligent tutoring system.
In Artificial Intelligence in Education. Springer, 41–50.

SAO PEDRO, M. A., BAKER, R. S., AND GOBERT, J. D. 2013. Incorporating scaffolding and tutor
context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data
Mining. 185–192.

STAMPER, J. C., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2013. A comparison of model selection
metrics in datashop. In Educational Data Mining. 284–287.

TOTH, Z., TALAGRAND, O., CANDILLE, G., AND ZHU, Y. 2003. Forecast Verification: A Practitioner’s
Guide in Atmospheric Science. Wiley, Chapter Probability and ensemble forecasts, 137–163.

WANG, Y. AND BECK, J. 2013. Class vs. student in a bayesian network student model. In Artificial
Intelligence in Education. Springer, 151–160.

WANG, Y. AND HEFFERNAN, N. 2013. Extending knowledge tracing to allow partial credit: using continuous
versus binary nodes. In Artificial Intelligence in Education. Springer, 181–188.

YUDELSON, M. V. AND KOEDINGER, K. R. 2013. Estimating the benefits of student model improvements
on a substantive scale. In EDM 2013 Workshops Proceedings.

YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J. 2013. Individualized bayesian knowledge
tracing models. In Artificial Intelligence in Education. Springer, 171–180.
EDM 2015 Journal Track