Researchers use many different metrics for evaluation of performance of student models. The aim of this paper is to provide an overview of commonly used metrics, to discuss properties, advantages, and disadvantages of different metrics, to summarize current practice in educational data mining, and to provide guidance for evaluation of student models. In the discussion we mention the relation of metrics to parameter fitting, the impact of student models on student practice (over-practice, under-practice), and point out connections to related work on evaluation of probability forecasters in other domains. We also provide an empirical comparison of metrics. One of the conclusion of the paper is that some commonly used metrics should not be used (MAE) or should be used more critically (AUC).
How to Cite
adaptive tutoring system for mathematics that addresses cognition, metacognition and affect.
International Journal of Artificial Intelligence in Education 24, 4, 387–426.
BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008a. Improving contextual models of guessing
and slipping with a truncated training set. In Educational Data Mining. 67–76.
BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008b. More accurate student modeling through
contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent
Tutoring Systems. Springer, 406–415.
BAKER, R. S., CORBETT, A. T., GOWDA, S. M., WAGNER, A. Z., MACLAREN, B. A., KAUFFMAN,
L. R., MITCHELL, A. P., AND GIGUERE, S. 2010. Contextual slip and prediction of student performance
after use of an intelligent tutor. In User Modeling, Adaptation, and Personalization. Springer,
BAKER, R. S., CORBETT, A. T., AND KOEDINGER, K. R. 2004. Detecting student misuse of intelligent
tutoring systems. In Intelligent tutoring systems. Springer Berlin Heidelberg, 531–540.
BAKER, R. S., CORBETT, A. T., ROLL, I., AND KOEDINGER, K. R. 2008. Developing a generalizable
detector of when students game the system. User Modeling and User-Adapted Interaction 18, 3,
BAKER, R. S., GOWDA, S. M., WIXON, M., KALKA, J., WAGNER, A. Z., SALVI, A., ALEVEN, V.,
KUSBIT, G. W., OCUMPAUGH, J., AND ROSSI, L. 2012. Sensor-free affect detection in cognitive
tutor algebra. In Educational Data Mining. 126–133.
BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future
visions. Journal of Educational Data Mining 1, 1, 3–17.
BARNES, T. 2005. The q-matrix method: Mining student response data for knowledge. In American
Association for Artificial Intelligence 2005 Educational Data Mining Workshop.
BECK, J. E. AND CHANG, K.-M. 2007. Identifiability: A fundamental problem of student modeling. In
User Modeling 2007. Springer, 137–146.
BECK, J. E. AND MOSTOW, J. 2008. How who should practice: Using learning decomposition to evaluate
the efficacy of different types of practice for different types of students. In Intelligent Tutoring
Systems. Springer, 353–362.
BECK, J. E. AND XIONG, X. 2013. Limits to accuracy: How well can we do at student modeling. In
Educational Data Mining. 4–11.
BRIER, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review
78, 1, 1–3.
BROCKER, J. AND SMITH, L. A. 2007. Increasing the reliability of reliability diagrams. Weather and
forecasting 22, 3, 651–661.
BULL, S. 2004. Supporting learning with open learner models. In Information and Communication Technologies
CARUANA, R. AND NICULESCU-MIZIL, A. 2004. Data mining in metric space: an empirical analysis
of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 69–78.
CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning factors analysis–a general method for
cognitive model evaluation and improvement. In Intelligent Tutoring Systems. Springer, 164–175.
CEN, H., KOEDINGER, K. R., AND JUNKER, B. 2007. Is over practice necessary?-improving learning
efficiency with the cognitive tutor through educational data mining. Frontiers in Artificial Intelligence
and Applications 158, 511.
COCEA, M., HERSHKOVITZ, A., AND BAKER, R. S. 2009. The impact of off-task and gaming behaviors
on learning: immediate or aggregate? In Artificial Intelligence in Education. 507–514.
COHEN, I. AND GOLDSZMIDT, M. 2004. Properties and benefits of calibrated classifiers. In Knowledge
Discovery in Databases: PKDD 2004. Springer, 125–136.
CONATI, C. AND MACLAREN, H. 2009. Empirically building and evaluating a probabilistic model of
user affect. User Modeling and User-Adapted Interaction 19, 3, 267–303.
CORBETT, A. AND ANDERSON, J. 1995. Knowledge tracing: Modeling the acquisition of procedural
knowledge. User modeling and user-adapted interaction 4, 4, 253–278.
DESMARAIS, M. C. AND DE BAKER, R. S. J. 2012. A review of recent advances in learner and skill
modeling in intelligent learning environments. User Model. User-Adapt. Interact. 22, 1-2, 9–38.
DHANANI, A., LEE, S. Y., PHOTHILIMTHANA, P., AND PARDOS, Z. 2014. A comparison of error
metrics for learning model parameters in bayesian knowledge tracing. Tech. rep., Technical Report
UCB/EECS-2014-131, EECS Department, University of California, Berkeley.
D’MELLO, S. K., CRAIG, S. D., WITHERSPOON, A., MCDANIEL, B., AND GRAESSER, A. 2008.
Automatic detection of learner’s affect from conversational cues. User modeling and user-adapted
interaction 18, 1-2, 45–80.
DODD, L. E. AND PEPE, M. S. 2003. Partial AUC estimation and regression. Biometrics 59, 3, 614–623.
FANCSALI, S. E., NIXON, T., AND RITTER, S. 2013. Optimal and worst-case performance of mastery
learning assessment with bayesian knowledge tracing. In Proceedings of the 6th International
Conference on Educational Data Mining.
FANCSALI, S. E., NIXON, T., VUONG, A., AND RITTER, S. 2013. Simulated students, mastery learning,
and improved learning curves for real-world cognitive tutors. In AIED Workshops. Citeseer.
FAWCETT, T. 2006. An introduction to roc analysis. Pattern recognition letters 27, 8, 861–874.
FERRI, C., HERN´A NDEZ-ORALLO, J., AND MODROIU, R. 2009. An experimental comparison of performance
measures for classification. Pattern Recognition Letters 30, 1, 27–38.
FOGARTY, J., BAKER, R. S., AND HUDSON, S. E. 2005. Case studies in the use of ROC curve analysis
for sensor-based estimates in human computer interaction. In Proc. of Graphics Interface 2005. 129–
GNEITING, T. AND RAFTERY, A. E. 2007. Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association 102, 477, 359–378.
GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. 2010. Comparing knowledge tracing and performance
factor analysis by using multiple model fitting procedures. In Intelligent Tutoring Systems. Springer,
GONZALEZ-BRENES, J., HUANG, Y., AND BRUSILOVSKY, P. 2014. General features in knowledge
tracing: Applications to multiple subskills, temporal item response theory, and expert knowledge. In
Proc. of Educational Data Mining. 84–91.
GONZALEZ-BRENES, J. P. AND MOSTOW, J. 2013. What and when do students learn? fully data-driven
joint estimation of cognitive and student models. In Proceedings of the 6th International Conference
on Educational Data Mining. 236–240.
HAMILL, T. M. AND JURAS, J. 2006. Measuring forecast skill: is it real skill or is it the varying climatology?
Quarterly Journal of the Royal Meteorological Society 132, 621C, 2905–2923.
HAND, D. J. 2009. Measuring classifier performance: a coherent alternative to the area under the roc
curve. Machine learning 77, 1, 103–123.
HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative
filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1, 5–53.
HERSHKOVITZ, A., DE BAKER, R. S. J., GOBERT, J., WIXON, M., AND SAO PEDRO, M. 2013.
Discovery with models: A case study on carelessness in computer-based science inquiry. American
Behavioral Scientist 57, 10, 1480–1499.
JARUSEK, P. AND PELANEK, R. 2012. Analysis of a simple model of problem solving times. In Proc. of
Intelligent Tutoring Systems. LNCS, vol. 7315. Springer, 379–388.
JENI, L. A., COHN, J. F., AND DE LA TORRE, F. 2013. Facing imbalanced data–recommendations
for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013
Humaine Association Conference on. IEEE, 245–251.
JEWSON, S. 2003. Use of the likelihood for measuring the skill of probabilistic forecasts. arXiv preprint
KASER, T., KOEDINGER, K. R., AND GROSS, M. 2014. Different parameters-same prediction: An
analysis of learning curves. In Proc. of Educational Data Mining. 52–59.
KHAJAH, M., WING, R. M., LINDSEY, R. V., AND MOZER, M. C. 2014. Integrating latent-factor and
knowledge-tracing models to predict individual differences in learning. In Proc. of Educational Data
KLINKENBERG, S., STRAATEMEIER, M., AND VAN DER MAAS, H. 2011. Computer adaptive practice
of maths ability using a new item response model for on the fly ability and difficulty estimation.
Computers & Education 57, 2, 1813–1824.
LEE, J. I. AND BRUNSKILL, E. 2012. The impact on individualizing student models on necessary practice
opportunities. In Educational Data Mining. 118–125.
LIU, C., WHITE, M., AND NEWELL, G. 2011. Measuring and comparing the accuracy of species distribution
models with presence–absence data. Ecography 34, 2, 232–243.
LIU, R., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2014. Interpreting model discovery and testing
generalization to a new dataset. In Educational Data Mining. 107–113.
LOBO, J. M., JIMENEZ-VALVERDE, A., AND REAL, R. 2008. AUC: a misleading measure of the performance
of predictive distribution models. Global ecology and Biogeography 17, 2, 145–151.
MURPHY, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology
12, 4, 595–600.
NICULESCU-MIZIL, A. AND CARUANA, R. 2005. Predicting good probabilities with supervised learning.
In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.
NIZNAN, J., PELANEK, R., AND PAPOUSEK, J. 2015. Exploring the role of small differences in predictive
accuracy using simulated data. In AIED Workshop on Simulated Learners.
NIZNAN, J., PELANEK, R., AND RIHAK, J. 2015. Student models for prior knowledge estimation. In
Educational Data Mining.
PAPOUSEK, J., PELANEK, R., AND STANISLAV, V. 2014. Adaptive practice of facts in domains with
varied prior knowledge. In Educational Data Mining. 6–13.
PARDOS, Z. A., BERGNER, Y., SEATON, D. T., AND PRITCHARD, D. E. 2013. Adapting bayesian
knowledge tracing to a massive open online course in edx. In Educational Data Mining. 137–144.
PARDOS, Z. A., GOWDA, S. M., BAKER, R. S., AND HEFFERNAN, N. T. 2012. The sum is greater
than the parts: ensembling models of student knowledge in educational software. ACM SIGKDD
explorations newsletter 13, 2, 37–44.
PARDOS, Z. A. AND HEFFERNAN, N. T. 2010. Modeling individualization in a bayesian networks implementation
of knowledge tracing. In User Modeling, Adaptation, and Personalization. Springer,
PARDOS, Z. A. AND HEFFERNAN, N. T. 2011. Kt-idem: Introducing item difficulty to the knowledge
tracing model. In User Modeling, Adaption and Personalization. Springer, 243–254.
PARDOS, Z. A. AND YUDELSON, M. V. 2013. Towards moment of learning accuracy. In AIED 2013
Workshops Proceedings Volume 4. 3.
PAVLIK, P. I., CEN, H., AND KOEDINGER, K. R. 2009. Performance factors analysis-a new alternative
to knowledge tracing. In Proc. of Artificial Intelligence in Education (AIED). Frontiers in Artificial
Intelligence and Applications, vol. 200. IOS Press, 531–538.
PELANEK, R. 2014. Time decay functions and elo system in student modeling. In Educational Data
PELANEK, R. 2015. Modeling student learning: Binary or continuous skill? In Educational Data Mining.
QIU, Y., QI, Y., LU, H., PARDOS, Z. A., AND HEFFERNAN, N. T. 2011. Does time matter? modeling
the effect of time with bayesian knowledge tracing. In Educational Data Mining. 139–148.
ROULSTON, M. S. AND SMITH, L. A. 2002. Evaluating probabilistic forecasts using information theory.
Monthly Weather Review 130, 6.
SAN PEDRO, M. O. Z., BAKER, R. S., GOWDA, S. M., AND HEFFERNAN, N. T. 2013. Towards an
understanding of affect and knowledge from student interaction with an intelligent tutoring system.
In Artificial Intelligence in Education. Springer, 41–50.
SAO PEDRO, M. A., BAKER, R. S., AND GOBERT, J. D. 2013. Incorporating scaffolding and tutor
context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data
STAMPER, J. C., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2013. A comparison of model selection
metrics in datashop. In Educational Data Mining. 284–287.
TOTH, Z., TALAGRAND, O., CANDILLE, G., AND ZHU, Y. 2003. Forecast Verification: A Practitioner’s
Guide in Atmospheric Science. Wiley, Chapter Probability and ensemble forecasts, 137–163.
WANG, Y. AND BECK, J. 2013. Class vs. student in a bayesian network student model. In Artificial
Intelligence in Education. Springer, 151–160.
WANG, Y. AND HEFFERNAN, N. 2013. Extending knowledge tracing to allow partial credit: using continuous
versus binary nodes. In Artificial Intelligence in Education. Springer, 181–188.
YUDELSON, M. V. AND KOEDINGER, K. R. 2013. Estimating the benefits of student model improvements
on a substantive scale. In EDM 2013 Workshops Proceedings.
YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J. 2013. Individualized bayesian knowledge
tracing models. In Artificial Intelligence in Education. Springer, 171–180.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.