Applying Psychometric Modeling to aid Feature Engineering in Predictive Log-Data Analytics: The NAEP EDM Competition



Published Aug 26, 2021
Fabian Zehner Beate Eichmann Tobias Deribo Scott Harrison Daniel Bengs Nico Andersen Carolin Hahnel


The NAEP EDM Competition required participants to predict efficient test-taking behavior based on log data. This paper describes our top-down approach for engineering features by means of psychometric modeling, aiming at machine learning for the predictive classification task. For feature engineering, we employed, among others, the Log-Normal Response Time Model for estimating latent person speed, and the Generalized Partial Credit Model for estimating latent person ability. Additionally, we adopted an n-gram feature approach for event sequences. Furthermore, instead of using the provided binary target label, we distinguished inefficient test takers who were going too fast and those who were going too slow for training a multi-label classifier. Our best-performing ensemble classifier comprised three sets of low-dimensional classifiers, dominated by test-taker speed. While our classifier reached moderate performance, relative to the competition leaderboard, our approach makes two important contributions. First, we show how classifiers that contain features engineered through literature-derived domain knowledge can provide meaningful predictions if results can be contextualized to test administrators who wish to intervene or take action. Second, our re-engineering of test scores enabled us to incorporate person ability into the models. However, ability was hardly predictive of efficient behavior, leading to the conclusion that the target label's validity needs to be questioned. Beyond competition-related findings, we furthermore report a state sequence analysis for demonstrating the viability of the employed tools. The latter yielded four different test-taking types that described distinctive differences between test takers, providing relevant implications for assessment practice.

How to Cite

Zehner, F., Eichmann, B., Deribo, T., Harrison, S., Bengs, D., Andersen, N., & Hahnel, C. (2021). Applying Psychometric Modeling to aid Feature Engineering in Predictive Log-Data Analytics: The NAEP EDM Competition. Journal of Educational Data Mining, 13(2), 80–107.
Abstract 403 | PDF Downloads 321



log files, psychometric models, domain knowledge–based feature engineering, process data, state sequence analysis, clustering, latent state, ensemble

AERA/APA/NCME. 2014. Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC.

BAKER, R., WOOLF, B., KATZ, I., FORSYTH, C., AND OCUMPAUGH, J. 2019. Nation’s Report Card Data Mining Competition 2019.

BAKER, R., WOOLF, B., KATZ, I., FORSYTH, C., AND OCUMPAUGH, J. 2020. Press release: 2019 NAEPEducational Data Mining Competition Results Announced.

BISCHL, B., LANG, M., KOTTHOFF, L., SCHIFFNER, J., RICHTER, J., STUDERUS, E., CASALIC-CHIO, G., AND JONES, Z. M. 2016. mlr: Machine learning in R. Journal of Machine Learning Research 17, 170, 1–5.

COHEN, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1, 37–46.

FOX, J.-P., KLOTZKE, K., AND ENTINK, R. K. 2019. LNIRT: LogNormal Response Time Item Response Theory Models. R package version 0.4.0.

GABADINHO, A., RITSCHARD, G., MÜLLER, N. S., AND STUDER, M. 2011. Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40, 4, 1–37.

GEIRHOS, R., JACOBSEN, J.-H., MICHAELIS, C., ZEMEL, R., BRENDEL, W., BETHGE, M., AND WICHMANN, F. A. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence 2, 11 (Nov.), 665–673.

GOLDHAMMER, F., MARTENS, T., AND LÜDTKE, O. 2017. Conditioning factors of test-taking engagement in PIAAC: An exploratory IRT modeling approach considering person and item characteristics. Large-Scale Assessments in Education 5, 1, 1–25.

GOLDHAMMER, F. AND ZEHNER, F. 2017. What to make of and how to interpret process data. Measurement: Interdisciplinary Research and Perspectives 15, 3-4, 128–132.

GRAESSER, A. C. AND BLACK, J. B. 2017. The Psychology of Questions. Psychology Revivals. Routledge.

GRAESSER, A. C. AND FRANKLIN, S. P. 1990. QUEST: A cognitive model of question answering. Discourse Processes 13, 3, 279–303.

HE, Q. AND VON DAVIER, M. 2016. Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment. In Handbook of Research on Technology Tools for Real-World Skill Development, Y. Rosen, S. Ferrara, and M. Mosharraf, Eds. IGI Global, Hershey, PA, 750–777.

JAKWERTH, P. M. AND STANCAVAGE, F. B. 2003. An Investigation of Why Students Do Not Respond to Questions. NAEP Validity Studies. Working Paper Series. Tech. Rep. NCES-WP-2003-12, National Center for Education Statistics, Washington, D.C. Apr.

KLEINENTINK, R. H., FOX, J.-P., AND VAN DER LINDEN, W. J. 2008. A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika 74, 1, 21–48.

KROEHNE, U. 2019. LogFSM: Analyzing log data from educational assessments using finite state machines.

KROEHNE, U. AND GOLDHAMMER, F. 2018. How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items. Behaviormetrika 45, 2 (Aug.), 527–563.

LIU, Y., LI, Z., LIU, H., AND LUO, F. 2019. Modeling test-taking non-effort in MIRT models. Frontiers in Psychology 10, 145.

MANNING, C. D., MANNING, C. D., AND SCHÜTZE, H. 1999. Foundations of statistical natural language processing. MIT Press.

MERRIAM-WEBSTER. 2021. dictionary/efficiency. Efficiency.

MURAKI, E. 1992. A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement 16, 2, 159–176.

NATIONAL ASSESSMENT GOVERNING BOARD. 2017. Mathematics Framework for the 2017 National Assessment of Educational Progress. National Assessment Governing Board, Washington, DC.


POHL, S., GRÄFE, L., AND ROSE, N. 2014. Dealing with omitted and not-reached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educational and Psychological Measurement 74, 3, 423–452.

R CORE TEAM. 2020. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

RASCH, G. 1960/1980. Probabilistic models for some intelligence and attainment tests. University of Chicago Press, Chicago, IL.

ROBITZSCH, A., KIEFER, T., AND WU, M. 2019. TAM: Test analysis modules. R package version 3.310.

SAHIN, F. AND COLVIN, K. F. 2020. Enhancing response time thresholds with response behaviors for detecting disengaged examinees. Large-scale Assessments in Education 8, 1–24.

SCHNIPKE, D. L. AND SCRAMS, D. J. 1997. Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement 34, 3, 213–232.

STUDER, M. 2013. WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R. LIVES Working Papers 24.

TOURANGEAU, R., RIPS, L. J., AND RASINSKI, K. A. 2009. The Psychology of Survey Response, 10. print ed. Cambridge University Press, Cambridge.

VAN DER LINDEN, W. J. 2006. A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics 31, 2, 181–204.

VAN DER LINDEN, W. J. 2007. A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 72, 3, 287–308.

WARD, J. H. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 301, 236–244.

WARM, T. A. 1989. Weighted likelihood estimation of ability in item response theory. Psychometrika 54, 3, 427–450.

WISE, S. L. 2017. Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice 36, 4, 52–61.

WISE, S. L. 2019. An information-based approach to identifying rapid-guessing thresholds. Applied Measurement in Education 32, 4, 325–336.

WISE, S. L. AND DEMARS, C. E. 2005. Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment 10, 1, 1–17.

WISE, S. L. AND KONG, X. 2005. Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education 18, 2, 163–183.

ZEHNER, F., HARRISON, S., EICHMANN, B., DERIBO, T., BENGS, D., ANDERSEN, N., AND HAHNEL, C. 2020. The NAEP Data Mining Competition: On the value of theory-driven psychometrics and machine learning for predictions based on log data. In Proceedings of the Thirteenth International Conference on Educational Data Mining, A. N. Rafferty, J. Whitehill, C. Romero, and V. Cavalli-Sforza, Eds. 302–312.
Scientific Findings from the NAEP 2019 Data Mining Competition