Comparing Zero-Shot Large Language Model Prompting with Human Coding of Theory Concepts in Student Essays

Main

Sidebar

Published April 6, 2026
Shelley Keith Philip I. Pavlik, Jr. Kristen L. Stives Laura Jean Kerr

Abstract

Recent studies have explored the cost and time benefits of using artificial intelligence (AI), particularly large language models (LLMs), in coding student essays. While these models show promise, not enough is understood about the factors that affect how their qualitative coding performance compares to human coding. This study examines coding accuracy for content errors in college student essays on criminological theories by comparing human-coded results with outputs from four LLMs. We evaluated human-AI correlations, AI error, and AI bias across four LLMs, five prompt types, three theory content coding dimensions, and four criminological theories. Results indicate that LLM choice significantly influenced human-AI correspondence, with Claude Sonnet 4 exhibiting the best overall performance and GPT 4.1 Mini the worst. Prompt type had minimal impact on performance. Across models, error rates were lowest when identifying whether students listed a concept, and highest when assessing whether definitions were correct. LLMs performed better on concise theories than on more complex ones. The code is available at https://github.com/imrryr/LLM-queries

How to Cite

Comparing Zero-Shot Large Language Model Prompting with Human Coding of Theory Concepts in Student Essays. (2026). Journal of Educational Data Mining, 18(1), 286-317. https://doi.org/10.5281/zenodo.19443160
Abstract 0 | PDF Downloads 0

Details

Keywords

large language models, automated essay scoring, prompt engineering, student writing, criminological theory, qualitative coding

References
Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., and Baldwin, B. 2024. LLM Stability: A detailed analysis with some surprises. arXiv:2408.04667v1 [cs.CL]

Atkinson, J., and Palma, F. 2025. An LLM-based hybrid approach for enhanced automated essay scoring. Scientific Reports 15, 14551.

Auerbach, C., and Silverstein, L. B. 2003. Qualitative data: An introduction to coding and analysis. NYU Press.

Baidoo-Anu, D., and Ansah, L. O. 2023. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI 7, 1, 52-62.

Bano, M., Hoda, R., Zowghi, D., and Treude, C. 2024. Large language models for qualitative research in software engineering: Exploring opportunities and challenges. Automated Software Engineering 31, 1, 1-12.

Barany, A., Nasiar, N., Porter, C., Zambrano, A. F., Andres, A. L., Bright, D., Shah, M., Liu, X., Gao, S., Zhang, J., Mehta, S., Choi, J., Giordano, C., and Baker, R. S. 2024. ChatGPT for dducation research: exploring the potential of large language models for qualitative codebook development. In Proceedings of the International Conference on Artificial Intelligence in Education, A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt Eds., 134-149.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877-1901.

Cain, W. 2024. Prompting change: Exploring prompt engineering in large language model AI and its potential to transform education. TechTrends 68, 47-57.

Chew, R., Bollenbacher, J., Wenger, M., Speer, J., and Kim, A. 2023. LLM-Assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv 2306.14924.

Chung, K. W. K., and O’Neil, H. F. 1997. Methodological approaches to online scoring of essays. National Center for Research on Evaluation, Standards, and Student Testing, Los Angeles, CA, 1-35.

Coetzer, Z., and van Aardt, P. 2024. Unsilencing the student voice: Detecting and addressing ChatGPT-generated texts presented as student-authored texts at a university writing centre. International Journal of Critical Diversity Studies 6, 2, 151-179.

Dikli, S. 2006. An overview of automated scoring of essays. Journal of Technology, Learning and Assessment.

Google for Educators. 2024. Generative AI for educators. Google LLC. https://grow.google/ai-for-educators/

Essid, J., and Cummins, C. 2025. A future for writing centers? Generative AI and what students are saying. The Peer Review 9, 2

Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Al-Garadi, M. A., Wu, J., and Mirjalili, S. 2023. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea preprints 1, 3, 1-26.

Hayes, A. F., and Krippendorff, K. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 1, 77-89.

Herrenkohl, L. R., and Cornelius, L. 2013. Investigating elementary students' scientific and historical argumentation. The Journal of the Learning Sciences 22, 413-461.

Hila, A., and Hauser, E. 2025. Assessing the reliability of large language models for deductive qualitative coding: A comparative intervention study with ChatGPT. Proceedings of the Association for Information Science & Technology 62, 1, 275-285.

Jiang, J. A., Wade, K., Fiesler, C., and Brubaker, J. R. 2021. Supporting serendipity: Opportunities and challenges for human-AI collaboration in qualitative analysis. In Proceedings of the ACM on Human Computer Interaction, 1-23.

Johnson, M., and Zhang, M. 2024. Examining the responsible use of zero-shot AI approaches to scoring essays. Scientific Reports 14, 1, 30064.

Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A. Seidel, T., … Kasneci, G. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103, 102274.

Kastner, S., Keith, S., Kerr, L. J., Stives, K. L., Knight-Rorie, W., Forsythe, K., Few, K., Jen, C., and Lockhart, J. M. 2018. RAD collaboration in the writing center: An impact study of course-embedded writing center support on student writing in a criminological theory course. Praxis: A Writing Center Journal 15, 3, 34-53.

Kavukcuoglu, K. 2025, March 25. Gemini 2.5: Our most intelligent AI model. The Keyword, https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/.

Ke, Z., and Ng, V. 2019. Automated essay scoring: A survey of the state of the art. IJCAI 19, 6300–6308.

Keith, S., Stives, K. L., Kerr, L. J., and Kastner, S. 2020. The role of academic background and the writing centre on students’ academic achievement in a writing-intensive criminological theory course. Educational Studies 46, 2, 154-169.

Kim, J., Park, S., Jeong, K., Lee, S., Han, S. H., Lee, J., and Kang, P. 2023. Which is better? Exploring prompting strategy for LLM-based metrics. arXiv preprint arXiv 2311.03754.

Kim, J., Yu, S., Lee, S.-S., and Detrick, R. 2025. Students’ prompt patterns and its effects in AI-assisted academic writing: Focusing on students’ level of AI literacy. Journal of Research on Technology in Education, 1-18.

Kirstin, E., Buckmann, A., Mhaidli, A., and Becker, S. 2024. Decoding complexity: Exploring human-AI concordance in qualitative coding. arXiv 2403.06607.

Kooli, C., and Yusuf, N. 2025. Transforming educational assessment: Insights into the use of ChatGPT and large language models in grading. International Journal of Human–Computer Interaction 41, 5, 3388-3399.

Lim, C. T., Bong, C. H., Wong, W. S., and Lee, N. K. 2021. A comprehensive review of automated essay scoring (AES) research and development. Pertanika Journal of Science and Technology 29, 3, 1875 - 1899.

Linneberg, M. S., and Korsgaard, S. 2019. Coding qualitative data: A synthesis guiding the novice. Qualitative Research Journal 19, 3, 259-270.

Liu, X., Zambrano, A. F., Baker, R. S., Barany, A., Ocumpaugh, J., Zhang, J., Pankiewicz, M., Nasiar, N., and Wei, Z. 2025. Qualitative coding with GPT-4: Where it works better. Journal of Learning Analytics 1, 1, 1-10.

Lloyd-Jones, R. 1977. Primary trait scoring. In Evaluating writing: Describing, measuring, judging, C. R. Cooper, and L. Odell, Ed. National Council of Teachers of English, 33-66.

Lo, L. S. 2023a. The art and science or prompt engineering: A new literacy in the information age. Internet References Services Journal 27, 4, 203-210.

Lo, L. S. 2023b. The clear path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship 49, 4, 102720.

Lopez-Fierro, S., and Nguyen, H. 2024. Making human-AI contributions transparent in qualitative coding. In Proceedings of the 17th International Conference on Computer-Supported Collaborative Learning-CSCL 2024, International Society of the Learning Sciences, 3-10.

Mansour, W., Albatarno, S., Eltanbouly, S., and Elsayed, T. 2024. Can large language models automatically score proficiency of written essays? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue Eds., 2777-2786.

McCarthy, K. S., Yan, E. F., Allen, L. K., Sonia, A. N., Magliano, J. P., and McNamara, D. S. 2022. On the basis of source: Impacts of individual differences on multiple-document integrated reading and writing tasks. Learning and Instruction 79, 101599.

McClure, J., Smyslova, D., Hall, A., and Jiang, S. 2024. Deductive coding's role in AI vs. human performance. In Proceedings of the 17th International Conference on Educational Data Mining (EDM-2024 poster session), B. Paaßen, and C. D. Epp Eds., 809-813.

Mizumoto, A., and Eguchi, M. 2023. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics 2, 2, 100050.

Morgan, D. L. 2023. Exploring the use of artificial intelligence for qualitative data analysis: The case of ChatGPT. Journal of Qualitative Methods 22, 1-10.

Ohta, R., Plakans, L. M., & Gebril, A. 2018. Integrated writing scores based on holistic and multi-trait scales: A generalizability analysis. Assessing Writing 38, 21-36.

OpenAI 2025, April 14. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/

OpenRouter 2025, July 30. OpenRouter. https://openrouter.ai/

Pacchioni, F., Flutti, E., Caruso, P., Fregna, L., Attanasio, F., Passani, C., Colomo, C., and Travaini, G. 2025. Generative AI and criminology: A threat or a promise? Exploring the potential and pitfalls in the identification of Techniques of Neutralization PLOS ONE 20, 4, 1-15.

Pack, A., Barrett, A., and Escalante, J. 2024. Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence 6, 100234, 1-9.

Page, E. B. 1966. The imminence of… grading essays by computer. The Phi Delta Kappa 47, 5, 238-243.

Park, J., and Choo, C. 2024. Generative AI prompt engineering for educators: Practical strategies. Journal of Special Education Technology 40, 3, 411-417.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8, 9.

Ramesh, D., and Sanampudi, S. K. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review 55, 2495-2527.

Reynolds, L., and McDonell, K. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI Conference on human factors in computing systems, Y. Kitamura, A. Quigley, K. Isbister, and T. Igarashi Eds., 1-7.

Saunders, P. I. 1999. Primary trait scoring: A direct assessment option for educators. In Proceedings of the National Council of Teachers of English Annual Convention.

Schmalbach, V. 2025. Does temperature 0 guarantee deterministic LLM outputs? https://www.vincentschmalbach.com/does-temperature-0-guarantee-deterministic-llm-outputs/

Seßler, K., Fürstenberg, M., Bühler, B., and Kasneci, E. 2025. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the LAK25: The 15th International Learning Analytics and Knowledge Conference, 462-472.

Sevcikova, B. L. 2018. Human versus automated essay scoring: A critical review. Arab World English Journal 9, 2, 157-174.

Shen, X., Chen, Z., Backes, M., and Zhang, Y. 2023. In ChatGPT we trust? Measuring and characterizing the reliability of ChatGPT. arXiv 2304.08979.

Shermis, M. D., and Wilson, J. 2021. Introduction to Automatic Essay Evaluation. In The Routledge International Handbook of Automated Essay Evaluation M. D. Shermis, and J. Wilson, Eds. Routledge, 3-22.

Sun, K., and Wang, R. 2024. Automatic essay multi-dimensional scoring with fine-tuning and multiple regression. arXiv 2406.01198.

Tripathi, S., Alkhulaifat, D., Lyo, S., Sukumaran, R., Li, B., Acharya, V., McBeth, R., and Cook, T. S. 2025. A hitchhiker’s guide to good prompting practices for large language models in radiology. Journal of the American College of Radiology 22, 7, 841-847.

Velásquez-Henao, J. D., Franco-Cardona, C. J., and Cadavid-Higuita, L. 2023. Prompt engineering: A methodology for optimizing interactions with AI-language models in the field of engineering. Dyna 90, 230, 9-17.

Wang, W., Haddow, B., Birch, A., and Peng, W. 2023. Assessing the reliability of large language model knowledge. arXiv 2410.0124.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824-24837.

Williams, M., and Moser, T. 2019. The art of coding and thematic exploration in qualitative research. International Management Review 15, 1, 45-55.

Yoshida, L. 2024. The impact of example selection in few-shot prompting on automated essay scoring using GPT models. In Proceedings of the International Conference on Artificial Intelligence in Education A. M. Olney, I. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt Eds. Cham: Springer Nature Switzerland, 61-73.

Yun, J. 2023. Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching 78, 3, 105-124.

Zhang, L., Wu, H., Huang, X., Duan, T., and Du, H. 2024. Automatic deductive coding in discourse analysis: an application of large language models in learning analytics. arXiv preprint arXiv 2410.0124.
Section
Special Section: Human-AI Partnership for Qualitative Analysis