Comparing Zero-Shot Large Language Model Prompting with Human Coding of Theory Concepts in Student Essays
Main
Sidebar
Abstract
Recent studies have explored the cost and time benefits of using artificial intelligence (AI), particularly large language models (LLMs), in coding student essays. While these models show promise, not enough is understood about the factors that affect how their qualitative coding performance compares to human coding. This study examines coding accuracy for content errors in college student essays on criminological theories by comparing human-coded results with outputs from four LLMs. We evaluated human-AI correlations, AI error, and AI bias across four LLMs, five prompt types, three theory content coding dimensions, and four criminological theories. Results indicate that LLM choice significantly influenced human-AI correspondence, with Claude Sonnet 4 exhibiting the best overall performance and GPT 4.1 Mini the worst. Prompt type had minimal impact on performance. Across models, error rates were lowest when identifying whether students listed a concept, and highest when assessing whether definitions were correct. LLMs performed better on concise theories than on more complex ones. The code is available at https://github.com/imrryr/LLM-queries
How to Cite
Details
large language models, automated essay scoring, prompt engineering, student writing, criminological theory, qualitative coding
Atkinson, J., and Palma, F. 2025. An LLM-based hybrid approach for enhanced automated essay scoring. Scientific Reports 15, 14551.
Auerbach, C., and Silverstein, L. B. 2003. Qualitative data: An introduction to coding and analysis. NYU Press.
Baidoo-Anu, D., and Ansah, L. O. 2023. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI 7, 1, 52-62.
Bano, M., Hoda, R., Zowghi, D., and Treude, C. 2024. Large language models for qualitative research in software engineering: Exploring opportunities and challenges. Automated Software Engineering 31, 1, 1-12.
Barany, A., Nasiar, N., Porter, C., Zambrano, A. F., Andres, A. L., Bright, D., Shah, M., Liu, X., Gao, S., Zhang, J., Mehta, S., Choi, J., Giordano, C., and Baker, R. S. 2024. ChatGPT for dducation research: exploring the potential of large language models for qualitative codebook development. In Proceedings of the International Conference on Artificial Intelligence in Education, A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt Eds., 134-149.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877-1901.
Cain, W. 2024. Prompting change: Exploring prompt engineering in large language model AI and its potential to transform education. TechTrends 68, 47-57.
Chew, R., Bollenbacher, J., Wenger, M., Speer, J., and Kim, A. 2023. LLM-Assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv 2306.14924.
Chung, K. W. K., and O’Neil, H. F. 1997. Methodological approaches to online scoring of essays. National Center for Research on Evaluation, Standards, and Student Testing, Los Angeles, CA, 1-35.
Coetzer, Z., and van Aardt, P. 2024. Unsilencing the student voice: Detecting and addressing ChatGPT-generated texts presented as student-authored texts at a university writing centre. International Journal of Critical Diversity Studies 6, 2, 151-179.
Dikli, S. 2006. An overview of automated scoring of essays. Journal of Technology, Learning and Assessment.
Google for Educators. 2024. Generative AI for educators. Google LLC. https://grow.google/ai-for-educators/
Essid, J., and Cummins, C. 2025. A future for writing centers? Generative AI and what students are saying. The Peer Review 9, 2
Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Al-Garadi, M. A., Wu, J., and Mirjalili, S. 2023. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea preprints 1, 3, 1-26.
Hayes, A. F., and Krippendorff, K. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 1, 77-89.
Herrenkohl, L. R., and Cornelius, L. 2013. Investigating elementary students' scientific and historical argumentation. The Journal of the Learning Sciences 22, 413-461.
Hila, A., and Hauser, E. 2025. Assessing the reliability of large language models for deductive qualitative coding: A comparative intervention study with ChatGPT. Proceedings of the Association for Information Science & Technology 62, 1, 275-285.
Jiang, J. A., Wade, K., Fiesler, C., and Brubaker, J. R. 2021. Supporting serendipity: Opportunities and challenges for human-AI collaboration in qualitative analysis. In Proceedings of the ACM on Human Computer Interaction, 1-23.
Johnson, M., and Zhang, M. 2024. Examining the responsible use of zero-shot AI approaches to scoring essays. Scientific Reports 14, 1, 30064.
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A. Seidel, T., … Kasneci, G. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103, 102274.
Kastner, S., Keith, S., Kerr, L. J., Stives, K. L., Knight-Rorie, W., Forsythe, K., Few, K., Jen, C., and Lockhart, J. M. 2018. RAD collaboration in the writing center: An impact study of course-embedded writing center support on student writing in a criminological theory course. Praxis: A Writing Center Journal 15, 3, 34-53.
Kavukcuoglu, K. 2025, March 25. Gemini 2.5: Our most intelligent AI model. The Keyword, https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/.
Ke, Z., and Ng, V. 2019. Automated essay scoring: A survey of the state of the art. IJCAI 19, 6300–6308.
Keith, S., Stives, K. L., Kerr, L. J., and Kastner, S. 2020. The role of academic background and the writing centre on students’ academic achievement in a writing-intensive criminological theory course. Educational Studies 46, 2, 154-169.
Kim, J., Park, S., Jeong, K., Lee, S., Han, S. H., Lee, J., and Kang, P. 2023. Which is better? Exploring prompting strategy for LLM-based metrics. arXiv preprint arXiv 2311.03754.
Kim, J., Yu, S., Lee, S.-S., and Detrick, R. 2025. Students’ prompt patterns and its effects in AI-assisted academic writing: Focusing on students’ level of AI literacy. Journal of Research on Technology in Education, 1-18.
Kirstin, E., Buckmann, A., Mhaidli, A., and Becker, S. 2024. Decoding complexity: Exploring human-AI concordance in qualitative coding. arXiv 2403.06607.
Kooli, C., and Yusuf, N. 2025. Transforming educational assessment: Insights into the use of ChatGPT and large language models in grading. International Journal of Human–Computer Interaction 41, 5, 3388-3399.
Lim, C. T., Bong, C. H., Wong, W. S., and Lee, N. K. 2021. A comprehensive review of automated essay scoring (AES) research and development. Pertanika Journal of Science and Technology 29, 3, 1875 - 1899.
Linneberg, M. S., and Korsgaard, S. 2019. Coding qualitative data: A synthesis guiding the novice. Qualitative Research Journal 19, 3, 259-270.
Liu, X., Zambrano, A. F., Baker, R. S., Barany, A., Ocumpaugh, J., Zhang, J., Pankiewicz, M., Nasiar, N., and Wei, Z. 2025. Qualitative coding with GPT-4: Where it works better. Journal of Learning Analytics 1, 1, 1-10.
Lloyd-Jones, R. 1977. Primary trait scoring. In Evaluating writing: Describing, measuring, judging, C. R. Cooper, and L. Odell, Ed. National Council of Teachers of English, 33-66.
Lo, L. S. 2023a. The art and science or prompt engineering: A new literacy in the information age. Internet References Services Journal 27, 4, 203-210.
Lo, L. S. 2023b. The clear path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship 49, 4, 102720.
Lopez-Fierro, S., and Nguyen, H. 2024. Making human-AI contributions transparent in qualitative coding. In Proceedings of the 17th International Conference on Computer-Supported Collaborative Learning-CSCL 2024, International Society of the Learning Sciences, 3-10.
Mansour, W., Albatarno, S., Eltanbouly, S., and Elsayed, T. 2024. Can large language models automatically score proficiency of written essays? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue Eds., 2777-2786.
McCarthy, K. S., Yan, E. F., Allen, L. K., Sonia, A. N., Magliano, J. P., and McNamara, D. S. 2022. On the basis of source: Impacts of individual differences on multiple-document integrated reading and writing tasks. Learning and Instruction 79, 101599.
McClure, J., Smyslova, D., Hall, A., and Jiang, S. 2024. Deductive coding's role in AI vs. human performance. In Proceedings of the 17th International Conference on Educational Data Mining (EDM-2024 poster session), B. Paaßen, and C. D. Epp Eds., 809-813.
Mizumoto, A., and Eguchi, M. 2023. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics 2, 2, 100050.
Morgan, D. L. 2023. Exploring the use of artificial intelligence for qualitative data analysis: The case of ChatGPT. Journal of Qualitative Methods 22, 1-10.
Ohta, R., Plakans, L. M., & Gebril, A. 2018. Integrated writing scores based on holistic and multi-trait scales: A generalizability analysis. Assessing Writing 38, 21-36.
OpenAI 2025, April 14. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/
OpenRouter 2025, July 30. OpenRouter. https://openrouter.ai/
Pacchioni, F., Flutti, E., Caruso, P., Fregna, L., Attanasio, F., Passani, C., Colomo, C., and Travaini, G. 2025. Generative AI and criminology: A threat or a promise? Exploring the potential and pitfalls in the identification of Techniques of Neutralization PLOS ONE 20, 4, 1-15.
Pack, A., Barrett, A., and Escalante, J. 2024. Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence 6, 100234, 1-9.
Page, E. B. 1966. The imminence of… grading essays by computer. The Phi Delta Kappa 47, 5, 238-243.
Park, J., and Choo, C. 2024. Generative AI prompt engineering for educators: Practical strategies. Journal of Special Education Technology 40, 3, 411-417.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8, 9.
Ramesh, D., and Sanampudi, S. K. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review 55, 2495-2527.
Reynolds, L., and McDonell, K. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI Conference on human factors in computing systems, Y. Kitamura, A. Quigley, K. Isbister, and T. Igarashi Eds., 1-7.
Saunders, P. I. 1999. Primary trait scoring: A direct assessment option for educators. In Proceedings of the National Council of Teachers of English Annual Convention.
Schmalbach, V. 2025. Does temperature 0 guarantee deterministic LLM outputs? https://www.vincentschmalbach.com/does-temperature-0-guarantee-deterministic-llm-outputs/
Seßler, K., Fürstenberg, M., Bühler, B., and Kasneci, E. 2025. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the LAK25: The 15th International Learning Analytics and Knowledge Conference, 462-472.
Sevcikova, B. L. 2018. Human versus automated essay scoring: A critical review. Arab World English Journal 9, 2, 157-174.
Shen, X., Chen, Z., Backes, M., and Zhang, Y. 2023. In ChatGPT we trust? Measuring and characterizing the reliability of ChatGPT. arXiv 2304.08979.
Shermis, M. D., and Wilson, J. 2021. Introduction to Automatic Essay Evaluation. In The Routledge International Handbook of Automated Essay Evaluation M. D. Shermis, and J. Wilson, Eds. Routledge, 3-22.
Sun, K., and Wang, R. 2024. Automatic essay multi-dimensional scoring with fine-tuning and multiple regression. arXiv 2406.01198.
Tripathi, S., Alkhulaifat, D., Lyo, S., Sukumaran, R., Li, B., Acharya, V., McBeth, R., and Cook, T. S. 2025. A hitchhiker’s guide to good prompting practices for large language models in radiology. Journal of the American College of Radiology 22, 7, 841-847.
Velásquez-Henao, J. D., Franco-Cardona, C. J., and Cadavid-Higuita, L. 2023. Prompt engineering: A methodology for optimizing interactions with AI-language models in the field of engineering. Dyna 90, 230, 9-17.
Wang, W., Haddow, B., Birch, A., and Peng, W. 2023. Assessing the reliability of large language model knowledge. arXiv 2410.0124.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824-24837.
Williams, M., and Moser, T. 2019. The art of coding and thematic exploration in qualitative research. International Management Review 15, 1, 45-55.
Yoshida, L. 2024. The impact of example selection in few-shot prompting on automated essay scoring using GPT models. In Proceedings of the International Conference on Artificial Intelligence in Education A. M. Olney, I. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt Eds. Cham: Springer Nature Switzerland, 61-73.
Yun, J. 2023. Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching 78, 3, 105-124.
Zhang, L., Wu, H., Huang, X., Duan, T., and Du, H. 2024. Automatic deductive coding in discourse analysis: an application of large language models in learning analytics. arXiv preprint arXiv 2410.0124.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.