Data Plus Theory Equals Codebook: Leveraging LLMs for Human-AI Codebook Development

Main

Sidebar

Published January 24, 2026
Andres Felipe Zambrano Zhanlan Wei Jiayi Zhang Ryan S. Baker Jaclyn Ocumpaugh Amanda Barany Xiner Liu Yiqiu Zhou Luc Paquette Jeffrey Ginger Conrad Borchers

Abstract

Recent research has explored the use of Large Language Models (LLMs) to develop qualitative codebooks, mainly for inductive work with large datasets, where manual review is impractical. Although these efforts show promise, they often neglect the theoretical grounding essential to many types of qualitative analysis. This paper investigates the potential of GPT-4o to support theory-informed codebook development across two educational contexts. In the first study, we employ a three-step approach—drawing on Winne & Hadwin’s and Zimmerman’s Self-Regulated Learning (SRL) theories, think-aloud data, and human refinement—to evaluate GPT-4o’s ability to generate high-quality, theory-aligned codebooks. Results indicate that GPT-4o can effectively leverage its knowledge base to identify SRL constructs reflected in student problem-solving behavior. In the second study, we extend this approach to a STEM game-based learning context guided by Hidi & Renninger’s four-phase model of Interest Development. We compare four prompting strategies: no theories provided, theories named, full references given, and full-text theory papers supplied. Human evaluations show that naming the theory without including full references produced the most practical and usable codebook, while supplying full papers to the prompt enhanced theoretical alignment but reduced applicability. These findings suggest that GPT-4o can be a valuable partner in theory-driven qualitative research when grounded in well-established frameworks, but that attention to prompt design is required. Our results show that widely available foundation models—trained on large-scale open web and licensed datasets—can effectively distill established educational theories to support qualitative research and codebook development. The code for our codebook development process and all the employed prompts and codebooks produced by GPT are available for replication purposes at: https://osf.io/g3z4x

How to Cite

Data Plus Theory Equals Codebook: Leveraging LLMs for Human-AI Codebook Development. (2026). Journal of Educational Data Mining, 18(1), 25-65. https://doi.org/10.5281/zenodo.18352290
Abstract 3 | PDF Downloads 6 HTML Downloads 1

Details

Keywords

Large Language Models, Qualitative Codebooks, Interest Development, Self-Regulated Learning, Thematic Analysis, Codebook Development

References
Azevedo, F. S. 2011. Lines of practice: A practice-centered theory of interest relationships. Cognition and Instruction, 29(2), 147–184.

Baker, R. S., Hutt, S., Bosch, N., Ocumpaugh, J., Biswas, G., Paquette, L., Andres, J. M. A., Nasiar, N., and Munshi, A. 2024. Detector-driven classroom interviewing: Focusing qualitative researcher time by selecting cases in situ. Educational Technology Research and Development, 72, 2841–2863.

Bannert, M., Reimann, P., and Sonnenberg, C. 2014. Process mining techniques for analysing patterns and strategies in students’ self-regulated learning. Metacognition and Learning, 9, 161–185.

Barany, A., Nasiar, N., Porter, C., Zambrano, A. F., Andres, A., Bright, D., Choi, J., Gao, S., Giordano, C., Liu, X., Mehta, S., Shah, M., Zhang, J., and Baker, R. S. 2024. ChatGPT for education research: Exploring the potential of large language models for qualitative codebook development. In Proceedings of the International Conference on Artificial Intelligence in Education (Vol. 14830, pp. 134–149). Springer.

Bialik, M., Zhan, K., and Reich, J. 2025. Who coded it better? Exploring AI-assisted qualitative analysis through researcher reactions. In A. Barany, R. S. Baker, A. Katz, & J. Lin (Eds.), From data to discovery: LLMs for qualitative analysis in education (LAK ’25 Workshop). Dublin, Ireland.

Bingham, A. J., and Witkowsky, P. 2021. Deductive and inductive approaches to qualitative data analysis. In Analyzing and interpreting qualitative data: After the interview (pp. 133–146).

Blumer, H. (1954). The nature of race prejudice.

Borchers, C., Zhang, J., Baker, R. S., and Aleven, V. 2024. Using think-aloud data to understand relations between self-regulation cycle characteristics and student performance in intelligent tutoring systems. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK ’24) (pp. 529–539). ACM.

Borchers, C., Shahrokhian, B., Balzan, F., Tajik, E., Sankaranarayanan, S., and Simon, S. (2025). Temperature and persona shape LLM agent consensus with minimal accuracy gains in qualitative coding.

Braun, V., and Clarke, V. 2012. Thematic analysis. American Psychological Association.

Charmaz, K. 1983. Loss of self: A fundamental form of suffering in the chronically ill. Sociology of Health & Illness, 5, 168–195.

Charmaz, K. 2006. Constructing grounded theory: A practical guide through qualitative analysis. Sage.

Chen, J., Lotsos, A., Wang, G., Zhao, L., Sherin, B., Wilensky, U., and Horn, M. 2025. Processes matter: How ML/GAI approaches could support open qualitative coding of online discourse datasets. In Proceedings of the 18th International Conference on Computer-Supported Collaborative Learning (pp. 415–419). ISLS.

Chew, R., Bollenbacher, J., Wenger, M., Speer, J., and Kim, A. 2023. LLM-assisted content analysis: Using large language models to support deductive coding.

Corbin, J. M., and Strauss, A. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13, 3–21.

De Paoli, S. 2024. Performing an inductive thematic analysis of semi-structured interviews with a large language model. Social Science Computer Review, 42, 997–1019.

Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., and Ahmed, N. K. 2024. Bias and fairness in large language models: A survey. Computational Linguistics, 50, 1097–1179.

Gao, J., Guo, Y., Lim, G., Zhang, T., Zhang, Z., Li, T. J.-J., and Perrault, S. T. 2024. CollabCoder: A lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Article 11, pp. 1–29). ACM. https://doi.org/10.1145/3613904.3642002

Gibbs, G. R. 2018. The nature of qualitative analysis. In Analyzing qualitative data (2nd ed., pp. 1–16).

Giray, L. 2023. Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51, 2629–2633.

Greene, J. A., Bernacki, M. L., and Hadwin, A. F. 2023. Self-regulation. In Handbook of educational psychology (pp. 314–334).

Guest, G., Bunce, A., and Johnson, L. 2006. How many interviews are enough? Field Methods, 18, 59–82.

Hidi, S., and Renninger, K. A. 2006. The four-phase model of interest development. Educational Psychologist, 41, 111–127.

Hutt, S., Ocumpaugh, J., Ma, J., Andres, A. L., Bosch, N., Paquette, L., Biswas, G., and Baker, R. S. 2021. Investigating SMART models of self-regulation and their impact on learning. In Proceedings of the 14th International Conference on Educational Data Mining (pp. 580–587).

Hutt, S., Baker, R. S., Ocumpaugh, J., Munshi, A., Andres, J., Karumbaiah, S., Slater, S., Biswas, G., Paquette, L., Bosch, N., and van Velsen, M. 2022. Quick red fox: An app supporting a new paradigm in qualitative research on AIED for STEM. In Artificial intelligence in STEM education (pp. 319–332).

Irgens, G. A., Adisa, I. O., Sistla, D., Famaye, T., Bailey, C., Behboudi, A., and Adefisayo, A. O. 2024. Supporting theory building in design-based research through large-scale data-based models. In Proceedings of the 17th International Conference on Educational Data Mining (pp. 296–303).

Jiang, Y., Wang, W., and Xu, Y. 2025. Collaborative coding and debriefing with GPT-4o: Enhancing analytic rigor through dialogue. In A. Barany, R. S. Baker, A. Katz, & J. Lin (Eds.), From data to discovery: LLMs for qualitative analysis in education (LAK ’25 Workshop). Dublin, Ireland.

Katz, A., Gerhardt, M., and Soledad, M. 2024. Using generative text models to create qualitative codebooks for student evaluations of teaching. International Journal of Qualitative Methods, 23. https://doi.org/10.1177/16094069241293283

King, E. C., Benson, M., Raysor, S., Holme, T. A., Sewall, J., Koedinger, K. R., Aleven, V., and Yaron, D. J. 2022. The open-response chemistry cognitive assistance tutor system. Journal of Chemical Education, 99, 546–552.

Kirsten, E., Buckmann, A., Mhaidli, A., and Becker, S. 2024. Decoding complexity: Exploring human–AI concordance in qualitative coding. arXiv:2403.06607.

Koopman, B., and Zuccon, G. 2023. Dr ChatGPT tell me what I want to hear: How different prompts impact health answer correctness. In Proceedings of EMNLP 2023 (pp. 15012–15022). ACL.

Lane, H. C., Gadbury, M., Ginger, J., Yi, S., Comins, N., Henhapl, J., and Rivera-Rogers, A. 2022. Triggering STEM interest with Minecraft in a hybrid summer camp. Technology, Mind, and Behavior, 3(4), 580-597.

Linnenbrink-Garcia, L., Durik, A. M., Conley, A. M. M., Barron, K. E., Tauer, J. M., Karabenick, S. A., and Harackiewicz, J. M. 2010. Measuring situational interest in academic domains. Educational and Psychological Measurement, 70, 647–671.

Liu, X., Zhang, J., Barany, A., Pankiewicz, M., and Baker, R. S. 2024. Assessing the potential and limits of large language models in qualitative coding. In Advances in Quantitative Ethnography (pp. 89–103). Springer.

Liu, X., Zambrano, A. F., Baker, R. S., Barany, A., Ocumpaugh, J., Zhang, J., Pankiewicz, M., Nasiar, N., and Wei, Z. 2025. Qualitative coding with GPT-4: Where it works better. Journal of Learning Analytics, 12, 169–185.

López-Fierro, S., Shehzad, U., Zandi, A. S., Clarke-Midura, J., and Recker, M. 2025. Streamlining field note analysis: Leveraging GPT for further insights. Paper presented at the American Educational Research Association (AERA) Annual Meeting, Denver, CO.

McLaren, B. M., Lim, S.-J., Gagnon, F., Yaron, D., and Koedinger, K. R. 2006. Studying the effects of personalized language and worked examples. In Intelligent Tutoring Systems (pp. 318–328). Springer.

Modi, A., Veerubhotla, A. S., Rysbek, A., Huber, A., Wiltshire, B., Veprek, B., Gillick, D., Kasenberg, D., Ahmed, D., Jurenka, I., Cohan, J., She, J., Wilkowski, J., Alarakeyia, K., McKee, K. R., Wang, L., Kunesch, M., Schaeckermann, M., Pîslar, M., … and Assael, Y. 2024. LearnLM: Improving Gemini for learning. CoRR, abs/2412.16429.

Morgan, D. L. 2023. Exploring the use of artificial intelligence for qualitative data analysis. International Journal of Qualitative Methods, 22. https://doi.org/10.1177/16094069231211248

Mu, Y., Wu, B. P., Thorne, W., Robinson, A., Aletras, N., Scarton, C., Bontcheva, K., and Song, X. 2024. Navigating prompt complexity for zero-shot classification. In Proceedings of LREC–COLING 2024 (pp. 12074–12086).

Nguyen, H., Nguyen, V., Ludovise, S., and Santagata, R. 2025. Misrepresentation or inclusion: Promises of generative AI in climate change education. Learning, Media and Technology, 50, 393–409.

Ohmoto, Y., Shimojo, S., Morita, J., and Hayashi, Y. 2024. Estimation of ICAP states based on interaction data. Journal of Educational Data Mining, 16, 149–176.

Panadero, E. 2017. A review of self-regulated learning. Frontiers in Psychology, 8.

Peters, U., and Chin-Yee, B. 2025. Generalization bias in large language model summarization. Royal Society Open Science, 12, 241776.

Ramanathan, S., Lim, L.-A., Mottaghi, N. R., and Buckingham Shum, S. 2025. When the prompt becomes the codebook. In Proceedings of the 15th Learning Analytics and Knowledge Conference (LAK ’25) (pp. 713–725). ACM.

Rebedea, T., Dinu, R., Sreedhar, M. N., Parisien, C., and Cohen, J. 2023. NeMo Guardrails. In Proceedings of EMNLP 2023: System Demonstrations (pp. 431–445). ACL.

Renninger, K. A. 2009. Interest and identity development. Educational Psychologist, 44, 105–118.

Renninger, K. A., and Hidi, S. E. 2020. To level the playing field, develop interest. Policy Insights from the Behavioral and Brain Sciences, 7, 10–18.

Ruijten-Dodoiu, P. 2025. Collaborating with ChatGPT: Iterative thematic analysis. In A. Barany et al. (Eds.), From data to discovery: LLMs for qualitative analysis in education (LAK ’25 Workshop). Dublin, Ireland.

Rupp, A. A., Levy, R., DiCerbo, K. E., Sweet, S. J., Crawford, A. V., Caliço, T., Benson, M., Fay, D., Kunze, K. L., Mislevy, R. J., and Behrens, J. T. 2012. Putting ECD into practice. Journal of Educational Data Mining, 4, 49–110.

Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., and Chadha, A. 2024. A systematic survey of prompt engineering. CoRR, abs/2402.07927.

Saldaña, J. 2021. The coding manual for qualitative researchers. Sage.

Schäfer, K., Murray, J., and Tonya, B. 2025. Glows and grows. In A. Barany et al. (Eds.), From data to discovery (LAK ’25 Workshop). Dublin, Ireland.

Shaffer, D. W., and Ruis, A. R. 2021. How we code. In Advances in quantitative ethnography (pp. 62–77). Springer.

Shaffer, D. W., Collier, W., and Ruis, A. R. 2016. A tutorial on epistemic network analysis. Journal of Learning Analytics, 3, 9–45.

Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E. H., Schärli, N., and Zhou, D. 2023. Large language models can be easily distracted. In Proceedings of ICML 2023 (pp. 31210–31227).

Simon, S., Sankaranarayanan, S., Tajik, E., Borchers, C., Shahrokhian, B., Balzan, F., Strauß, S., Viswanathan, S. A., Ataş, A. H., Čarapina, M., Liang, L., and Celik, B. 2025. Comparing a human’s and a multi-agent system’s thematic analysis. In Artificial Intelligence in Education (pp. 60–73). Springer.

Strauss, A. L. 1987. Qualitative analysis for social scientists. Cambridge University Press.

Tai, Y. C., Patni, K. N., Hemauer, N., Desmarais, B., and Lin, Y.-R. 2025. GenAI vs. human fact-checkers. In Proceedings of the 17th ACM Web Science Conference (WebSci ’25) (pp. 516–521). ACM.

Wang, Y., Song, W., Tao, W., Liotta, A., Yang, D., Li, X., Gao, S., Sun, Y., Ge, W., and Zhang, W. 2022. A systematic review on affective computing. Information Fusion, 83, 19–52.

Wei, Z., Nasiar, N., Zambrano, A. F., Liu, X., Ocumpaugh, J., Barany, A., Baker, R. S., & Giordano, C. 2025. Exploring students’ interest-driven patterns. In Proceedings of the 19th International Conference of the Learning Sciences (ICLS ’25) (pp. 386–394).

Weston, C., Gandell, T., Beauchamp, J., McAlpine, L., Wiseman, C., and Beauchamp, C. 2001. Analyzing interview data. Qualitative Sociology, 24, 381–400.

Winne, P. H., and Hadwin, A. F. 1998. Studying as self-regulated learning. In Metacognition in educational theory and practice (pp. 277–304).

Zambrano, A. F., Liu, X., Barany, A., Baker, R. S., Kim, J., and Nasiar, N. 2023. From nCoder to ChatGPT. In International Conference on Quantitative Ethnography (pp. 470–485). Springer.

Zhang, J., Borchers, C., and Barany, A. 2024a. Studying the interplay of self-regulated learning cycles. In Advances in Quantitative Ethnography (pp. 231–246). Springer.

Zhang, J., Borchers, C., Aleven, V., and Baker, R. S. 2024b. Using large language models to detect self-regulated learning. In Proceedings of the 17th International Conference on Educational Data Mining (pp. 157–168).

Zhang, J., Andres, J. M. A. L., Hutt, S., Baker, R. S., Ocumpaugh, J., Mills, C., Brooks, J., Sethuraman, S., and Young, T. 2022. Detecting SMART model cognitive operations. In Proceedings of the International Conference on Educational Data Mining (pp. 75–85).

Zhou, Y., and Paquette, L. 2024. Investigating student interest in a Minecraft environment. In Proceedings of the 17th International Conference on Educational Data Mining (pp. 396–404).

Zimmerman, B. J. 2000. Attaining self-regulation. In Handbook of self-regulation (pp. 13–39). Academic Press.
Section
Special Section: Human-AI Partnership for Qualitative Analysis