Researchers use many different metrics for evaluation of performance of student models. The aim of this paper is to provide an overview of commonly used metrics, to discuss properties, advantages, and disadvantages of different metrics, to summarize current practice in educational data mining, and to provide guidance for evaluation of student models. In the discussion we mention the relation of metrics to parameter fitting, the impact of student models on student practice (over-practice, under-practice), and point out connections to related work on evaluation of probability forecasters in other domains. We also provide an empirical comparison of metrics. One of the conclusion of the paper is that some commonly used metrics should not be used (MAE) or should be used more critically (AUC).
How to Cite
metrics for evaluation, performance of student models, parameter fitting, student practice, probability forecasters
BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008a. Improving contextual models of guessing and slipping with a truncated training set. In Educational Data Mining. 67–76.
BAKER, R. S., CORBETT, A. T., AND ALEVEN, V. 2008b. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent Tutoring Systems. Springer, 406–415.
BAKER, R. S., CORBETT, A. T., GOWDA, S. M., WAGNER, A. Z., MACLAREN, B. A., KAUFFMAN, L. R., MITCHELL, A. P., AND GIGUERE, S. 2010. Contextual slip and prediction of student performance after use of an intelligent tutor. In User Modeling, Adaptation, and Personalization. Springer, 52–63.
BAKER, R. S., CORBETT, A. T., AND KOEDINGER, K. R. 2004. Detecting student misuse of intelligent tutoring systems. In Intelligent tutoring systems. Springer Berlin Heidelberg, 531–540.
BAKER, R. S., CORBETT, A. T., ROLL, I., AND KOEDINGER, K. R. 2008. Developing a generalizable detector of when students game the system. User Modeling and User-Adapted Interaction 18, 3, 287–314.
BAKER, R. S., GOWDA, S. M., WIXON, M., KALKA, J., WAGNER, A. Z., SALVI, A., ALEVEN, V.,
KUSBIT, G. W., OCUMPAUGH, J., AND ROSSI, L. 2012. Sensor-free affect detection in cognitive tutor algebra. In Educational Data Mining. 126–133.
BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 1, 3–17.
BARNES, T. 2005. The q-matrix method: Mining student response data for knowledge. In American Association for Artificial Intelligence 2005 Educational Data Mining Workshop.
BECK, J. E. AND CHANG, K.-M. 2007. Identifiability: A fundamental problem of student modeling. In User Modeling 2007. Springer, 137–146.
BECK, J. E. AND MOSTOW, J. 2008. How who should practice: Using learning decomposition to evaluate the efficacy of different types of practice for different types of students. In Intelligent Tutoring Systems. Springer, 353–362.
BECK, J. E. AND XIONG, X. 2013. Limits to accuracy: How well can we do at student modeling. In Educational Data Mining. 4–11.
BISHOP, C. 2006. Pattern recognition and machine learning. Springer.
BRIER, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1, 1–3. BR¨OCKER, J. AND SMITH, L. A. 2007. Increasing the reliability of reliability diagrams. Weather and forecasting 22, 3, 651–661.
BULL, S. 2004. Supporting learning with open learner models. In Information and Communication Technologies in Education.
CARUANA, R. AND NICULESCU-MIZIL, A. 2004. Data mining in metric space: an empirical analysis of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 69–78.
CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning factors analysis–a general method for cognitive model evaluation and improvement. In Intelligent Tutoring Systems. Springer, 164–175.
CEN, H., KOEDINGER, K. R., AND JUNKER, B. 2007. Is over practice necessary?-improving learning efficiency with the cognitive tutor through educational data mining. Frontiers in Artificial Intelligence and Applications 158, 511.
COCEA, M., HERSHKOVITZ, A., AND BAKER, R. S. 2009. The impact of off-task and gaming behaviors on learning: immediate or aggregate? In Artificial Intelligence in Education. 507–514.
COHEN, I. AND GOLDSZMIDT, M. 2004. Properties and benefits of calibrated classifiers. In Knowledge Discovery in Databases: PKDD 2004. Springer, 125–136.
CONATI, C. AND MACLAREN, H. 2009. Empirically building and evaluating a probabilistic model of user affect. User Modeling and User-Adapted Interaction 19, 3, 267–303.
CORBETT, A. AND ANDERSON, J. 1995. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4, 4, 253–278.
DESMARAIS, M. C. AND DE BAKER, R. S. J. 2012. A review of recent advances in learner and skill modeling in intelligent learning environments. User Model. User-Adapt. Interact. 22, 1-2, 9–38.
DHANANI, A., LEE, S. Y., PHOTHILIMTHANA, P., AND PARDOS, Z. 2014. A comparison of error metrics for learning model parameters in bayesian knowledge tracing. Tech. rep., Technical Report UCB/EECS-2014-131, EECS Department, University of California, Berkeley.
D’MELLO, S. K., CRAIG, S. D., WITHERSPOON, A., MCDANIEL, B., AND GRAESSER, A. 2008. Automatic detection of learner’s affect from conversational cues. User modeling and user-adapted interaction 18, 1-2, 45–80.
DODD, L. E. AND PEPE, M. S. 2003. Partial AUC estimation and regression. Biometrics 59, 3, 614–623.
FANCSALI, S. E., NIXON, T., AND RITTER, S. 2013. Optimal and worst-case performance of mastery learning assessment with bayesian knowledge tracing. In Proceedings of the 6th International Conference on Educational Data Mining.
FANCSALI, S. E., NIXON, T., VUONG, A., AND RITTER, S. 2013. Simulated students, mastery learning, and improved learning curves for real-world cognitive tutors. In AIED Workshops. Citeseer.
FAWCETT, T. 2006. An introduction to roc analysis. Pattern recognition letters 27, 8, 861–874.
FERRI, C., HERN´A NDEZ-ORALLO, J., AND MODROIU, R. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1, 27–38.
FOGARTY, J., BAKER, R. S., AND HUDSON, S. E. 2005. Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. In Proc. of Graphics Interface 2005. 129– 136.
GNEITING, T. AND RAFTERY, A. E. 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 477, 359–378.
GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. 2010. Comparing knowledge tracing and performance factor analysis by using multiple model fitting procedures. In Intelligent Tutoring Systems. Springer, 35–44.
GONZ´A LEZ-BRENES, J., HUANG, Y., AND BRUSILOVSKY, P. 2014. General features in knowledge tracing: Applications to multiple subskills, temporal item response theory, and expert knowledge. In Proc. of Educational Data Mining. 84–91.
GONZ´A LEZ-BRENES, J. P. AND MOSTOW, J. 2013. What and when do students learn? fully data-driven joint estimation of cognitive and student models. In Proceedings of the 6th International Conference on Educational Data Mining. 236–240.
HAMILL, T. M. AND JURAS, J. 2006. Measuring forecast skill: is it real skill or is it the varying climatology? Quarterly Journal of the Royal Meteorological Society 132, 621C, 2905–2923.
HAND, D. J. 2009. Measuring classifier performance: a coherent alternative to the area under the roc curve. Machine learning 77, 1, 103–123.
HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1, 5–53.
HERSHKOVITZ, A., DE BAKER, R. S. J., GOBERT, J., WIXON, M., AND SAO PEDRO, M. 2013. Discovery with models: A case study on carelessness in computer-based science inquiry. American Behavioral Scientist 57, 10, 1480–1499.
JARUˇS EK, P. AND PEL´A NEK, R. 2012. Analysis of a simple model of problem solving times. In Proc. of Intelligent Tutoring Systems. LNCS, vol. 7315. Springer, 379–388.
JENI, L. A., COHN, J. F., AND DE LA TORRE, F. 2013. Facing imbalanced data–recommendations for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 245–251.
JEWSON, S. 2003. Use of the likelihood for measuring the skill of probabilistic forecasts. arXiv preprint physics/0308046. K¨ASER, T., KOEDINGER, K. R., AND GROSS, M. 2014. Different parameters-same prediction: An analysis of learning curves. In Proc. of Educational Data Mining. 52–59.
KHAJAH, M., WING, R. M., LINDSEY, R. V., AND MOZER, M. C. 2014. Integrating latent-factor and knowledge-tracing models to predict individual differences in learning. In Proc. of Educational Data Mining.
KLINKENBERG, S., STRAATEMEIER, M., AND VAN DER MAAS, H. 2011. Computer adaptive practice of maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education 57, 2, 1813–1824.
LEE, J. I. AND BRUNSKILL, E. 2012. The impact on individualizing student models on necessary practice opportunities. In Educational Data Mining. 118–125.
LIU, C., WHITE, M., AND NEWELL, G. 2011. Measuring and comparing the accuracy of species distribution models with presence–absence data. Ecography 34, 2, 232–243.
LIU, R., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2014. Interpreting model discovery and testing generalization to a new dataset. In Educational Data Mining. 107–113.
LOBO, J. M., JIM´ENEZ-VALVERDE, A., AND REAL, R. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography 17, 2, 145–151.
MURPHY, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology 12, 4, 595–600.
NICULESCU-MIZIL, A. AND CARUANA, R. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.
NIˇZ NAN, J., PEL´ANEK, R., AND PAPOUˇS EK, J. 2015. Exploring the role of small differences in predictive accuracy using simulated data. In AIED Workshop on Simulated Learners.
NIˇZ NAN, J., PEL´A NEK, R., AND ˇR IH´A K, J. 2015. Student models for prior knowledge estimation. In Educational Data Mining.
PAPOUˇSEK, J., PEL´A NEK, R., AND STANISLAV, V. 2014. Adaptive practice of facts in domains with varied prior knowledge. In Educational Data Mining. 6–13.
PARDOS, Z. A., BERGNER, Y., SEATON, D. T., AND PRITCHARD, D. E. 2013. Adapting bayesian knowledge tracing to a massive open online course in edx. In Educational Data Mining. 137–144.
PARDOS, Z. A., GOWDA, S. M., BAKER, R. S., AND HEFFERNAN, N. T. 2012. The sum is greater than the parts: ensembling models of student knowledge in educational software. ACM SIGKDD explorations newsletter 13, 2, 37–44.
PARDOS, Z. A. AND HEFFERNAN, N. T. 2010. Modeling individualization in a bayesian networks implementation of knowledge tracing. In User Modeling, Adaptation, and Personalization. Springer, 255–266.
PARDOS, Z. A. AND HEFFERNAN, N. T. 2011. Kt-idem: Introducing item difficulty to the knowledge tracing model. In User Modeling, Adaption and Personalization. Springer, 243–254.
PARDOS, Z. A. AND YUDELSON, M. V. 2013. Towards moment of learning accuracy. In AIED 2013 Workshops Proceedings Volume 4. 3.
PAVLIK, P. I., CEN, H., AND KOEDINGER, K. R. 2009. Performance factors analysis-a new alternative to knowledge tracing. In Proc. of Artificial Intelligence in Education (AIED). Frontiers in Artificial Intelligence and Applications, vol. 200. IOS Press, 531–538.
PEL´ANEK, R. 2014. Time decay functions and elo system in student modeling. In Educational Data Mining. 21–27. PEL´ANEK, R. 2015. Modeling student learning: Binary or continuous skill? In Educational Data Mining.
QIU, Y., QI, Y., LU, H., PARDOS, Z. A., AND HEFFERNAN, N. T. 2011. Does time matter? modeling the effect of time with bayesian knowledge tracing. In Educational Data Mining. 139–148.
ROULSTON, M. S. AND SMITH, L. A. 2002. Evaluating probabilistic forecasts using information theory. Monthly Weather Review 130, 6.
SAN PEDRO, M. O. Z., BAKER, R. S., GOWDA, S. M., AND HEFFERNAN, N. T. 2013. Towards an understanding of affect and knowledge from student interaction with an intelligent tutoring system. In Artificial Intelligence in Education. Springer, 41–50.
SAO PEDRO, M. A., BAKER, R. S., AND GOBERT, J. D. 2013. Incorporating scaffolding and tutor context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data Mining. 185–192.
STAMPER, J. C., KOEDINGER, K. R., AND MCLAUGHLIN, E. A. 2013. A comparison of model selection metrics in datashop. In Educational Data Mining. 284–287.
TOTH, Z., TALAGRAND, O., CANDILLE, G., AND ZHU, Y. 2003. Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, Chapter Probability and ensemble forecasts, 137–163.
WANG, Y. AND BECK, J. 2013. Class vs. student in a bayesian network student model. In Artificial Intelligence in Education. Springer, 151–160.
WANG, Y. AND HEFFERNAN, N. 2013. Extending knowledge tracing to allow partial credit: using continuous versus binary nodes. In Artificial Intelligence in Education. Springer, 181–188.
YUDELSON, M. V. AND KOEDINGER, K. R. 2013. Estimating the benefits of student model improvements on a substantive scale. In EDM 2013 Workshops Proceedings.
YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J. 2013. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education. Springer, 171–180.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.