Automating Self-Affirmation Essay Coding: Fine-Tuned BERT Performance Comparable to Human Coders and Comparison with GPT-4
Main
Sidebar
Abstract
Previous studies have demonstrated that a self-affirmation writing intervention, in which students reflect on personally important values, positively impacts students' school performance, and there is active research on this intervention. However, this research requires manual coding of students' writing exercises, and this manual coding has proved to be a time-consuming and expensive undertaking. To assist future self-affirmation intervention studies or educators implementing the writing exercise, we employed our labeled data to fine-tune a pre-trained language model that achieves a comparable level of performance to that of human coders (Cohen's Kappa: 0.85 between machine coding and human coders as compared to 0.83 between human coders). To explore the potential of more advanced language models without requiring a large training dataset, we also evaluated OpenAI's GPT-4 in a zero-shot and few-shot classification setting. GPT-4's zero-shot predictions yield reasonable accuracy but do not reach the fine-tuned BERT model's performance or human-level agreement. Adding example essays (few-shot prompting) did not appreciably improve GPT-4's results. Our analysis also finds that the BERT model's performance is consistent across student subgroups, with minimal disparity between "stereotype-threatened" and "non-threatened" students, which are the focal groups for comparison in the self-affirmation intervention. We further demonstrate the generalizability of the fine-tuned model on an external dataset collected by a different research team: the model maintained a high agreement with human coders (Cohen's Kappa = 0.86) on this new sample. These results suggest that a fine-tuned transformer model can reliably code self-affirmation essays, thereby reducing the coding burden for future researchers and educators. We make the fine-tuned model publicly available to help the research community automate the burdensome task of coding at https://github.com/visortown/bert-self-affirm
How to Cite
Details
BERT fine-tuning, text classification, self-affirmation intervention, student writing coding, Automated essay classification, GPT-4
\def\MetaAuthorXMP{Cong Ye \sep Trisha H. Borman \sep Geoffrey D. Borman}
\def\MetaAuthorInfo{Cong Ye; Trisha H. Borman; Geoffrey D. Borman}
\def\MetaSubject{Previous studies have demonstrated that a self-affirmation writing intervention, in which students reflect on personally important values, positively impacts students' school performance, and there is active research on this intervention. However, this research requires manual coding of students' writing exercises, and this manual coding has proved to be a time-consuming and expensive undertaking. To assist future self-affirmation intervention studies or educators implementing the writing exercise, we employed our labeled data to fine-tune a pre-trained language model that achieves a comparable level of performance to that of human coders (Cohen's Kappa: 0.85 between machine coding and human coders as compared to 0.83 between human coders). To explore the potential of more advanced language models without requiring a large training dataset, we also evaluated OpenAI's GPT-4 in a zero-shot and few-shot classification setting. GPT-4's zero-shot predictions yield reasonable accuracy but do not reach the fine-tuned BERT model's performance or human-level agreement. Adding example essays (few-shot prompting) did not appreciably improve GPT-4's results. Our analysis also finds that the BERT model's performance is consistent across student subgroups, with minimal disparity between "stereotype-threatened" and "non-threatened" students, which are the focal groups for comparison in the self-affirmation intervention. We further demonstrate the generalizability of the fine-tuned model on an external dataset collected by a different research team: the model maintained a high agreement with human coders (Cohen's Kappa = 0.86) on this new sample. These results suggest that a fine-tuned transformer model can reliably code self-affirmation essays, thereby reducing the coding burden for future researchers and educators. We make the fine-tuned model publicly available to help the research community automate the burdensome task of coding at https://github.com/visortown/bert-self-affirm.}
\def\MetaKeywordsXMP{BERT fine-tuning \sep GPT-4 \sep text classification \sep self-affirmation intervention \sep student writing coding \sep automated essay classification}
\def\MetaKeywordsInfo{BERT fine-tuning, GPT-4, text classification, self-affirmation intervention, student writing coding, automated essay classification}
\begin{filecontents*}[overwrite]{\jobname.xmpdata}
% ADD METADATA BELOW: TITLE/AUTHORS/ABSTRACT/KEYWORDS
\Title{\MetaTitle}
\Author{\MetaAuthorXMP}
\Subject{\MetaSubject}
\Keywords{\MetaKeywordsXMP}
\end{filecontents*}
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{colorprofiles}
\usepackage[a-2b,mathxmp]{pdfx}[2018/12/22]
\usepackage{pdfpages}
\usepackage{pax}
%\iffalse
%% \usepackage{hyperref} %commenting b/c pdfx loads already
% \hypersetup{
% filebordercolor={1 1 0},
% pdfa,
% pdfstartview=,
% }
%\fi
\hypersetup{
pdftitle={\MetaTitle},
pdfauthor={\MetaAuthorInfo},
pdfsubject={\MetaSubject},
pdfkeywords={\MetaKeywordsInfo},
colorlinks = true, %Colours links instead of ugly boxes
urlcolor = blue, %Colour for external hyperlinks
linkcolor = blue, %Colour of internal links
citecolor = blue, %Colour of citations
}
\pdfinfo{
/Author (\MetaAuthorInfo)
/Keywords (\MetaKeywordsInfo)
}
%\usepackage[usenames]{xcolor}
%\usepackage[bottom=1cm,right=3.25cm,left=2.5cm,top=1cm]{geometry}
\usepackage[bottom=2cm,right=1in,left=1in,top=1cm]{geometry}
\usepackage{fancyhdr} %started above geometry load, moved below
%necessary for pax to insert /F Print code into pdf; will not validate as pdfa otherwise
%see https://github.com/ho-tex/oberdiek/issues/86
\makeatletter
\def\PAX@attrs@URI#1{/F 4 #1}
\makeatletter
% ROMAN NUMERALS FOR ACKNOWLEDGMENTS; COMMENT OUT THIS LINE FOR ARTICLES
%\renewcommand{\thepage}{\roman{page}}
\def\pdffilename{submission.pdf}
% CHANGE THE VOLUME/ISSUE/YEAR AND STARTING PAGE BELOW
\def\artvolume{18}
\def\artno{1}
\def\artmonthyear{2026}
\def\startpage{66} % page numbering based on whole volumes
%\def\showcolor{white}
\setcounter{page}{\startpage}
\pagestyle{fancy}
\rhead{~}
\chead{~}%\colorbox{white}{\raisebox{0pt}[48pt][24pt]{\parbox{\textwidth}{~}}}}
\cfoot{\colorbox{white}{\raisebox{0pt}[24pt][24pt]{\parbox{.8\textwidth}{~}}}} %original file had ~ as _, causing error
\rfoot{\colorbox{white}{\raisebox{0pt}[24pt][24pt]{Journal of Educational Data Mining, Volume \artvolume, No \artno, \artmonthyear}}}
\lfoot{\colorbox{white}{\raisebox{0pt}[24pt][24pt]{\parbox{.8\textwidth}{\thepage}}}}
\renewcommand{\headrulewidth}{0pt}
%IF WE CORRECT THIS ERROR, THE ORIGINAL PAGE # REAPPEARS ABOVE THE NEW FOOTER
%\setlength{\headheight}{72pt} %https://tex.stackexchange.com/questions/327285/what-does-this-warning-mean-fancyhdr-and-headheight
\begin{document}
%\includepdf[pages=-,rotateoversize=true,noautoscale=true,pagecommand={\thispagestyle{fancy}}]{\pdffilename}
%\includepdf[pages={-},pagecommand={\thispagestyle{fancy}}]{\pdffilename} %blows up, Missing $ inserted
%\includepdf[pages={-},pagecommand={\thispagestyle{fancy}},fitpaper]{\pdffilename}
\includepdf[pages=-,rotateoversize=false,noautoscale=true,pagecommand={\thispagestyle{fancy}}]{\pdffilename}
\end{document}
Agresti, Alan. 2002. Categorical Data Analysis. Hooken, New Jersey: John Wiley & Sons, Inc. ISBN 978-0-471-36093-3.
Aralimatti, R., Shakhadri, S. A. G., KR, K., and Angadi, K. B. 2025. Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective. arXiv preprint arXiv:2503.01933.
Assayed, S. K., Alkhatib, M., and Shaalan, K. 2024. A Transformer-Based Generative AI Model in Education: Fine-Tuning BERT for Domain-Specific in Student Advising. In Breaking Barriers with Generative Intelligence, A. Basiouni, and C. Frasson, Eds. Using GI to Improve Human Education and Well-Being. BBGI 2024. Communications in Computer and Information Science, vol 2162. Springer, Cham, 165–174.
Beseiso, M., and Alzahrani, S. 2020. An empirical analysis of BERT embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications, 11(10), 204–210.
Borman, G. D., Grigg, J., Rozek, C. S., Hanselman, P., and Dewey, N. A. 2018. Self-affirmation effects are produced by school context, student engagement with the intervention, and time: Lessons from a district-wide implementation. Psychological Science, 29(11), 1773–1784.
Breiman, L. 2001. Random forests. Machine learning, 45, 5–32.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S. and Amodei, D. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Chen, S., Lan, Y., and Yuan, Z. 2024. A multi-task automated assessment system for essay scoring. In Artificial Intelligence in Education, A. M. Olney, I. A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt, Eds. AIED 2024. Lecture Notes in Computer Science, vol 14830. Springer, Cham, 276–283.
Cohen, G. L., Garcia, J., Apfel, N., and Master, A. 2006. Reducing the racial achievement gap: A social-psychological intervention. Science, 313(5791), 1307–1310.
Cortes, C., and Vapnik, V. 1995. Support-vector networks. Machine learning, 20, 273–297.
Dasgupta, T., Naskar, A., Dey, L., and Saha, R. 2018. Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, Y. Tseng, H. Chen, V. Ng, and M. Komachi, Eds. Association for Computational Linguistics, 93–102.
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dong, F., and Zhang, Y. 2016. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing, J. Su, K. Duh, and X. Carreras, Eds. Association for Computational Linguistics, 1072–1077.
ElMassry, A. M., Zaki, N., AlSheikh, N., and Mediani, M. 2025. A Systematic Review of Pretrained Models in Automated Essay Scoring. IEEE Access, 13, 121902-121917.
Faulkner, A. 2014. Automated classification of stance in student essays: An approach using stance target information and the Wikipedia link-based measure. In The Twenty-Seventh International Flairs Conference, W. Eberle, and C. Boonthum-Denecke, Eds. Association for the Advancement of Artificial Intelligence, 174-179.
Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R., and Lan, A. 2022. Automated scoring for reading comprehension via in-context BERT tuning. In Artificial Intelligence in Education, M. M. Rodrigo, N. Matsuda, A. I. Cristea, and V. Dimitrova, Eds. AIED 2022. Lecture Notes in Computer Science, vol 13355. Springer, Cham, 691–697.
Firoozi, T. 2023. Using Automated Procedures to Score Written Essays in Persian: An Application of the Multilingual BERT System (Doctoral dissertation). University of Alberta.
Goyer, J. P., Garcia, J., Purdie-Vaughns, V., Binning, K. R., Cook, J. E., Reeves, S. L., Apfel, N., Taborsky-Barba, S., Sherman, D.K., and Cohen, G. L. 2017. Self-affirmation facilitates minority middle schoolers' progress along college trajectories. Proceedings of the National Academy of Sciences, 114(29), 7594–7599.
Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., Zhang, L. and Han, W. 2021. Pre-trained models: Past, present and future. AI Open, 2, 225–250.
Hanselman, P., Rozek, C., Grigg, J., Pyne, J., and Borman, G. D. 2017. New evidence on self-affirmation effects and theorized sources of heterogeneity from large-scale replications. Journal of Educational Psychology, 109(3), 405–424.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8), 1735–1780.
Kakarla, S., Borchers, C., Thomas, D., Bhushan, S., and Koedinger, K. R. 2025. Comparing few-shot prompting of GPT-4 LLMs with BERT classifiers for open-response assessment in tutor equity training. arXiv preprint arXiv:2501.06658.
Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., and Zimmermann, R. 2019. Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI conference on artificial intelligence, P. V. Hentenryck, and Z. Zhou, Eds. Association for the Advancement of Artificial Intelligence, 33(01), 9662–9669.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Liu, Q., Zhang, S., Wang, Q., and Chen, W. 2018. Mining online discussion data for understanding teachers’ reflective thinking. IEEE Transactions on Learning Technologies, 11(2), 243–254.
Liu, S., Liu, P., Wang, M., and Zhang, B. 2021. Effectiveness of stereotype threat interventions: A meta-analytic review. Journal of Applied Psychology, 106(6), 921–949.
Liu, T. J., and Steele, C. M. 1986. Attributional analysis as self-affirmation. Journal of Personality and Social Psychology, 51, 531–540.
Ludwig, S., Mayer, C., Hansen, C., Eilers, K., and Brandt, S. 2021. Automated essay scoring using transformer models. Psych, 3(4), 897–915.
Mathias, S., and Bhattacharyya, P. 2018. ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga, Eds. European Language Resources Association, 1169–1173.
McCallum, A., and Nigam, K. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, M. Sahami, Eds. Association for the Advancement of Artificial Intelligence, 752(1), 41–48.
McHugh, M. L. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276–282.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mizumoto, A., and Eguchi, M. 2023. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050, 1–13.
Pranckevičius, T., and Marcinkevičius, V. 2017. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, 5(2), 221–232.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. Accessed: 12-31-2025.
Razon, A., and Barnden, J. 2015. A new approach to automated text readability classification based on concept indexing with integrated part-of-speech n-gram features. In Proceedings of the international conference recent advances in natural language processing, R. Mitkov, G. Angelova, and K. Bontcheva, Eds. INCOMA, 521–528.
Rennie, J. D., Shih, L., Teevan, J., and Karger, D. R. 2003. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 20th international conference on machine learning (ICML-03), T. Fawcett, A. Picture, and N. Mishra, Eds. Association for the Advancement of Artificial Intelligence, 616–623.
Riddle, T., Bhagavatula, S. S., Guo, W., Muresan, S., Cohen, G., Cook, J. E., and Purdie-Vaughns, V. 2015. Mining a written values affirmation intervention to identify the unique linguistic features of stigmatized groups. Proceedings of the 8th Eighth International Conference on Educational Data Mining, O. C. Santos, J. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, M. C. Mihaescu, P. Moreno, A. Hershkovitz, S. Ventura, and M. Desmarais, Eds. International Educational Data Mining Society, 274–281.
Ruseti, S., Dascalu, M., Johnson, A. M., McNamara, D. S., Balyan, R., McCarthy, K. S., and Trausan-Matu, S. 2018. Scoring summaries using recurrent neural networks. In Intelligent Tutoring Systems: Proceedings of 14th International Conference, R. Nkambou, R. Azevedo, and J. Vassileva, Eds. Springer International Publishing, 191–201.
Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513–523.
Sherman, D. K., Hartson, K. A., Binning, K. R., Purdie-Vaughns, V., Garcia, J., Taborsky-Barba, S., Tomassetti, S., Nussbaum, A. D., and Cohen, G. L. 2013. Deflecting the trajectory and changing the narrative: how self-affirmation affects academic performance and motivation under identity threat. Journal of Personality and Social Psychology, 104(4), 591–618.
Steele, C. M., and Aronson, J. 1995. Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69, 797–811.
Steele, C. M., and Liu, T. J. 1983. Dissonance processes as self-affirmation. Journal of Personality and Social Psychology, 45(1), 5–19.
Sun, C., Qiu, X., Xu, Y., and Huang, X. 2019. How to fine-tune BERT for text classification? In China national conference on Chinese computational linguistics. M. Sun, Z. Liu, Y. Liu, X. Huang, and H. Ji, Eds. Springer Cham, 194–206.
Tang, Y., Tuncel, D., Koerner, C., and Runkler, T. 2025. The few-shot dilemma: Over-prompting large language models. arXiv preprint arXiv:2509.13196.
Tran, N., Pierce, B., Litman, D., Correnti, R., and Matsumura, L. C. 2024. Multi-dimensional performance analysis of large language models for classroom discussion assessment. Journal of Educational Data Mining, 16(2), 304–335.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30, 5998–6008.
Wang, Y., Wang, C., Li, R., and Lin, H. 2022. On the use of BERT for automated essay scoring: Joint learning of multi-scale essay representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Carpuat, M. de Marneffe, and I. V. M. Ruiz, Eds. Association for Computational Linguistics, 3416–3425.
Whitehill, J., and LoCasale-Crouch, J. 2024. Automated evaluation of classroom instructional support with LLMs and BoWs: Connecting global predictions to specific feedback. Journal of Educational Data Mining, 16(1), 34–60.
Xue, J., Tang, X., and Zheng, L. 2021. A hierarchical BERT-based transfer learning approach for multi-dimensional essay scoring. Ieee Access, 9, 125403–125415.
Ye, Z., Che, L., Ge, J., Qin, J., and Liu, J. 2024. Integration of multi-level semantics in PTMs with an attention model for question matching. PLoS ONE 19(8): e0305772.
Yoshida, L. 2024. The impact of example selection in few-shot prompting on automated essay scoring using GPT models. In International Conference on Artificial Intelligence in Education, A. M. Olney, I. A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt, Eds. Springer Cham, 61–73.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhong, R., Ghosh, D., Klein, D., and Steinhardt, J. 2021. Are larger pretrained language models uniformly better? comparing performance at the instance level. arXiv preprint arXiv:2105.06020.
Zhou, C., Sun, C., Liu, Z., and Lau, F. 2015. A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.