Automating Self-Affirmation Essay Coding: Fine-Tuned BERT Performance Comparable to Human Coders and Comparison with GPT-4

Main

Sidebar

Published January 30, 2026
Cong Ye Trisha H. Borman Geoffrey D. Borman

Abstract

Previous studies have demonstrated that a self-affirmation writing intervention, in which students reflect on personally important values, positively impacts students' school performance, and there is active research on this intervention. However, this research requires manual coding of students' writing exercises, and this manual coding has proved to be a time-consuming and expensive undertaking. To assist future self-affirmation intervention studies or educators implementing the writing exercise, we employed our labeled data to fine-tune a pre-trained language model that achieves a comparable level of performance to that of human coders (Cohen's Kappa: 0.85 between machine coding and human coders as compared to 0.83 between human coders). To explore the potential of more advanced language models without requiring a large training dataset, we also evaluated OpenAI's GPT-4 in a zero-shot and few-shot classification setting. GPT-4's zero-shot predictions yield reasonable accuracy but do not reach the fine-tuned BERT model's performance or human-level agreement. Adding example essays (few-shot prompting) did not appreciably improve GPT-4's results. Our analysis also finds that the BERT model's performance is consistent across student subgroups, with minimal disparity between "stereotype-threatened" and "non-threatened" students, which are the focal groups for comparison in the self-affirmation intervention. We further demonstrate the generalizability of the fine-tuned model on an external dataset collected by a different research team: the model maintained a high agreement with human coders (Cohen's Kappa = 0.86) on this new sample. These results suggest that a fine-tuned transformer model can reliably code self-affirmation essays, thereby reducing the coding burden for future researchers and educators. We make the fine-tuned model publicly available to help the research community automate the burdensome task of coding at https://github.com/visortown/bert-self-affirm

How to Cite

Automating Self-Affirmation Essay Coding: Fine-Tuned BERT Performance Comparable to Human Coders and Comparison with GPT-4. (2026). Journal of Educational Data Mining, 18(1), 66-88. https://doi.org/10.5281/zenodo.18435844
Abstract 15 | PDF Downloads 11 HTML Downloads 5

Details

Keywords

BERT fine-tuning, text classification, self-affirmation intervention, student writing coding, Automated essay classification, GPT-4

References
\def\MetaTitle{Automating Self-Affirmation Essay Coding: Fine-Tuned BERT Performance Comparable to Human Coders and Comparison with GPT-4}

\def\MetaAuthorXMP{Cong Ye \sep Trisha H. Borman \sep Geoffrey D. Borman}

\def\MetaAuthorInfo{Cong Ye; Trisha H. Borman; Geoffrey D. Borman}

\def\MetaSubject{Previous studies have demonstrated that a self-affirmation writing intervention, in which students reflect on personally important values, positively impacts students' school performance, and there is active research on this intervention. However, this research requires manual coding of students' writing exercises, and this manual coding has proved to be a time-consuming and expensive undertaking. To assist future self-affirmation intervention studies or educators implementing the writing exercise, we employed our labeled data to fine-tune a pre-trained language model that achieves a comparable level of performance to that of human coders (Cohen's Kappa: 0.85 between machine coding and human coders as compared to 0.83 between human coders). To explore the potential of more advanced language models without requiring a large training dataset, we also evaluated OpenAI's GPT-4 in a zero-shot and few-shot classification setting. GPT-4's zero-shot predictions yield reasonable accuracy but do not reach the fine-tuned BERT model's performance or human-level agreement. Adding example essays (few-shot prompting) did not appreciably improve GPT-4's results. Our analysis also finds that the BERT model's performance is consistent across student subgroups, with minimal disparity between "stereotype-threatened" and "non-threatened" students, which are the focal groups for comparison in the self-affirmation intervention. We further demonstrate the generalizability of the fine-tuned model on an external dataset collected by a different research team: the model maintained a high agreement with human coders (Cohen's Kappa = 0.86) on this new sample. These results suggest that a fine-tuned transformer model can reliably code self-affirmation essays, thereby reducing the coding burden for future researchers and educators. We make the fine-tuned model publicly available to help the research community automate the burdensome task of coding at https://github.com/visortown/bert-self-affirm.}

\def\MetaKeywordsXMP{BERT fine-tuning \sep GPT-4 \sep text classification \sep self-affirmation intervention \sep student writing coding \sep automated essay classification}

\def\MetaKeywordsInfo{BERT fine-tuning, GPT-4, text classification, self-affirmation intervention, student writing coding, automated essay classification}



\begin{filecontents*}[overwrite]{\jobname.xmpdata}

% ADD METADATA BELOW: TITLE/AUTHORS/ABSTRACT/KEYWORDS

\Title{\MetaTitle}

\Author{\MetaAuthorXMP}

\Subject{\MetaSubject}

\Keywords{\MetaKeywordsXMP}

\end{filecontents*}

\documentclass{article}

\usepackage[utf8]{inputenc}

\usepackage[T1]{fontenc}

\usepackage{colorprofiles}

\usepackage[a-2b,mathxmp]{pdfx}[2018/12/22]

\usepackage{pdfpages}

\usepackage{pax}

%\iffalse

%% \usepackage{hyperref} %commenting b/c pdfx loads already

% \hypersetup{

% filebordercolor={1 1 0},

% pdfa,

% pdfstartview=,

% }

%\fi

\hypersetup{

pdftitle={\MetaTitle},

pdfauthor={\MetaAuthorInfo},

pdfsubject={\MetaSubject},

pdfkeywords={\MetaKeywordsInfo},

colorlinks = true, %Colours links instead of ugly boxes

urlcolor = blue, %Colour for external hyperlinks

linkcolor = blue, %Colour of internal links

citecolor = blue, %Colour of citations

}

\pdfinfo{

/Author (\MetaAuthorInfo)

/Keywords (\MetaKeywordsInfo)

}

%\usepackage[usenames]{xcolor}

%\usepackage[bottom=1cm,right=3.25cm,left=2.5cm,top=1cm]{geometry}

\usepackage[bottom=2cm,right=1in,left=1in,top=1cm]{geometry}

\usepackage{fancyhdr} %started above geometry load, moved below



%necessary for pax to insert /F Print code into pdf; will not validate as pdfa otherwise

%see https://github.com/ho-tex/oberdiek/issues/86

\makeatletter

\def\PAX@attrs@URI#1{/F 4 #1}

\makeatletter



% ROMAN NUMERALS FOR ACKNOWLEDGMENTS; COMMENT OUT THIS LINE FOR ARTICLES

%\renewcommand{\thepage}{\roman{page}}



\def\pdffilename{submission.pdf}

% CHANGE THE VOLUME/ISSUE/YEAR AND STARTING PAGE BELOW

\def\artvolume{18}

\def\artno{1}

\def\artmonthyear{2026}

\def\startpage{66} % page numbering based on whole volumes

%\def\showcolor{white}



\setcounter{page}{\startpage}



\pagestyle{fancy}

\rhead{~}

\chead{~}%\colorbox{white}{\raisebox{0pt}[48pt][24pt]{\parbox{\textwidth}{~}}}}



\cfoot{\colorbox{white}{\raisebox{0pt}[24pt][24pt]{\parbox{.8\textwidth}{~}}}} %original file had ~ as _, causing error

\rfoot{\colorbox{white}{\raisebox{0pt}[24pt][24pt]{Journal of Educational Data Mining, Volume \artvolume, No \artno, \artmonthyear}}}

\lfoot{\colorbox{white}{\raisebox{0pt}[24pt][24pt]{\parbox{.8\textwidth}{\thepage}}}}

\renewcommand{\headrulewidth}{0pt}



%IF WE CORRECT THIS ERROR, THE ORIGINAL PAGE # REAPPEARS ABOVE THE NEW FOOTER

%\setlength{\headheight}{72pt} %https://tex.stackexchange.com/questions/327285/what-does-this-warning-mean-fancyhdr-and-headheight



\begin{document}

%\includepdf[pages=-,rotateoversize=true,noautoscale=true,pagecommand={\thispagestyle{fancy}}]{\pdffilename}

%\includepdf[pages={-},pagecommand={\thispagestyle{fancy}}]{\pdffilename} %blows up, Missing $ inserted

%\includepdf[pages={-},pagecommand={\thispagestyle{fancy}},fitpaper]{\pdffilename}

\includepdf[pages=-,rotateoversize=false,noautoscale=true,pagecommand={\thispagestyle{fancy}}]{\pdffilename}



\end{document}

Agresti, Alan. 2002. Categorical Data Analysis. Hooken, New Jersey: John Wiley & Sons, Inc. ISBN 978-0-471-36093-3.

Aralimatti, R., Shakhadri, S. A. G., KR, K., and Angadi, K. B. 2025. Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective. arXiv preprint arXiv:2503.01933.

Assayed, S. K., Alkhatib, M., and Shaalan, K. 2024. A Transformer-Based Generative AI Model in Education: Fine-Tuning BERT for Domain-Specific in Student Advising. In Breaking Barriers with Generative Intelligence, A. Basiouni, and C. Frasson, Eds. Using GI to Improve Human Education and Well-Being. BBGI 2024. Communications in Computer and Information Science, vol 2162. Springer, Cham, 165–174.

Beseiso, M., and Alzahrani, S. 2020. An empirical analysis of BERT embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications, 11(10), 204–210.

Borman, G. D., Grigg, J., Rozek, C. S., Hanselman, P., and Dewey, N. A. 2018. Self-affirmation effects are produced by school context, student engagement with the intervention, and time: Lessons from a district-wide implementation. Psychological Science, 29(11), 1773–1784.

Breiman, L. 2001. Random forests. Machine learning, 45, 5–32.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S. and Amodei, D. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Chen, S., Lan, Y., and Yuan, Z. 2024. A multi-task automated assessment system for essay scoring. In Artificial Intelligence in Education, A. M. Olney, I. A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt, Eds. AIED 2024. Lecture Notes in Computer Science, vol 14830. Springer, Cham, 276–283.

Cohen, G. L., Garcia, J., Apfel, N., and Master, A. 2006. Reducing the racial achievement gap: A social-psychological intervention. Science, 313(5791), 1307–1310.

Cortes, C., and Vapnik, V. 1995. Support-vector networks. Machine learning, 20, 273–297.

Dasgupta, T., Naskar, A., Dey, L., and Saha, R. 2018. Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, Y. Tseng, H. Chen, V. Ng, and M. Komachi, Eds. Association for Computational Linguistics, 93–102.

Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dong, F., and Zhang, Y. 2016. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing, J. Su, K. Duh, and X. Carreras, Eds. Association for Computational Linguistics, 1072–1077.

ElMassry, A. M., Zaki, N., AlSheikh, N., and Mediani, M. 2025. A Systematic Review of Pretrained Models in Automated Essay Scoring. IEEE Access, 13, 121902-121917.

Faulkner, A. 2014. Automated classification of stance in student essays: An approach using stance target information and the Wikipedia link-based measure. In The Twenty-Seventh International Flairs Conference, W. Eberle, and C. Boonthum-Denecke, Eds. Association for the Advancement of Artificial Intelligence, 174-179.

Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R., and Lan, A. 2022. Automated scoring for reading comprehension via in-context BERT tuning. In Artificial Intelligence in Education, M. M. Rodrigo, N. Matsuda, A. I. Cristea, and V. Dimitrova, Eds. AIED 2022. Lecture Notes in Computer Science, vol 13355. Springer, Cham, 691–697.

Firoozi, T. 2023. Using Automated Procedures to Score Written Essays in Persian: An Application of the Multilingual BERT System (Doctoral dissertation). University of Alberta.

Goyer, J. P., Garcia, J., Purdie-Vaughns, V., Binning, K. R., Cook, J. E., Reeves, S. L., Apfel, N., Taborsky-Barba, S., Sherman, D.K., and Cohen, G. L. 2017. Self-affirmation facilitates minority middle schoolers' progress along college trajectories. Proceedings of the National Academy of Sciences, 114(29), 7594–7599.

Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., Zhang, L. and Han, W. 2021. Pre-trained models: Past, present and future. AI Open, 2, 225–250.

Hanselman, P., Rozek, C., Grigg, J., Pyne, J., and Borman, G. D. 2017. New evidence on self-affirmation effects and theorized sources of heterogeneity from large-scale replications. Journal of Educational Psychology, 109(3), 405–424.

Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8), 1735–1780.

Kakarla, S., Borchers, C., Thomas, D., Bhushan, S., and Koedinger, K. R. 2025. Comparing few-shot prompting of GPT-4 LLMs with BERT classifiers for open-response assessment in tutor equity training. arXiv preprint arXiv:2501.06658.

Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., and Zimmermann, R. 2019. Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI conference on artificial intelligence, P. V. Hentenryck, and Z. Zhou, Eds. Association for the Advancement of Artificial Intelligence, 33(01), 9662–9669.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

Liu, Q., Zhang, S., Wang, Q., and Chen, W. 2018. Mining online discussion data for understanding teachers’ reflective thinking. IEEE Transactions on Learning Technologies, 11(2), 243–254.

Liu, S., Liu, P., Wang, M., and Zhang, B. 2021. Effectiveness of stereotype threat interventions: A meta-analytic review. Journal of Applied Psychology, 106(6), 921–949.

Liu, T. J., and Steele, C. M. 1986. Attributional analysis as self-affirmation. Journal of Personality and Social Psychology, 51, 531–540.

Ludwig, S., Mayer, C., Hansen, C., Eilers, K., and Brandt, S. 2021. Automated essay scoring using transformer models. Psych, 3(4), 897–915.

Mathias, S., and Bhattacharyya, P. 2018. ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga, Eds. European Language Resources Association, 1169–1173.

McCallum, A., and Nigam, K. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, M. Sahami, Eds. Association for the Advancement of Artificial Intelligence, 752(1), 41–48.

McHugh, M. L. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276–282.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mizumoto, A., and Eguchi, M. 2023. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050, 1–13.

Pranckevičius, T., and Marcinkevičius, V. 2017. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, 5(2), 221–232.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. Accessed: 12-31-2025.

Razon, A., and Barnden, J. 2015. A new approach to automated text readability classification based on concept indexing with integrated part-of-speech n-gram features. In Proceedings of the international conference recent advances in natural language processing, R. Mitkov, G. Angelova, and K. Bontcheva, Eds. INCOMA, 521–528.

Rennie, J. D., Shih, L., Teevan, J., and Karger, D. R. 2003. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 20th international conference on machine learning (ICML-03), T. Fawcett, A. Picture, and N. Mishra, Eds. Association for the Advancement of Artificial Intelligence, 616–623.

Riddle, T., Bhagavatula, S. S., Guo, W., Muresan, S., Cohen, G., Cook, J. E., and Purdie-Vaughns, V. 2015. Mining a written values affirmation intervention to identify the unique linguistic features of stigmatized groups. Proceedings of the 8th Eighth International Conference on Educational Data Mining, O. C. Santos, J. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, M. C. Mihaescu, P. Moreno, A. Hershkovitz, S. Ventura, and M. Desmarais, Eds. International Educational Data Mining Society, 274–281.

Ruseti, S., Dascalu, M., Johnson, A. M., McNamara, D. S., Balyan, R., McCarthy, K. S., and Trausan-Matu, S. 2018. Scoring summaries using recurrent neural networks. In Intelligent Tutoring Systems: Proceedings of 14th International Conference, R. Nkambou, R. Azevedo, and J. Vassileva, Eds. Springer International Publishing, 191–201.

Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513–523.

Sherman, D. K., Hartson, K. A., Binning, K. R., Purdie-Vaughns, V., Garcia, J., Taborsky-Barba, S., Tomassetti, S., Nussbaum, A. D., and Cohen, G. L. 2013. Deflecting the trajectory and changing the narrative: how self-affirmation affects academic performance and motivation under identity threat. Journal of Personality and Social Psychology, 104(4), 591–618.

Steele, C. M., and Aronson, J. 1995. Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69, 797–811.

Steele, C. M., and Liu, T. J. 1983. Dissonance processes as self-affirmation. Journal of Personality and Social Psychology, 45(1), 5–19.

Sun, C., Qiu, X., Xu, Y., and Huang, X. 2019. How to fine-tune BERT for text classification? In China national conference on Chinese computational linguistics. M. Sun, Z. Liu, Y. Liu, X. Huang, and H. Ji, Eds. Springer Cham, 194–206.

Tang, Y., Tuncel, D., Koerner, C., and Runkler, T. 2025. The few-shot dilemma: Over-prompting large language models. arXiv preprint arXiv:2509.13196.

Tran, N., Pierce, B., Litman, D., Correnti, R., and Matsumura, L. C. 2024. Multi-dimensional performance analysis of large language models for classroom discussion assessment. Journal of Educational Data Mining, 16(2), 304–335.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30, 5998–6008.

Wang, Y., Wang, C., Li, R., and Lin, H. 2022. On the use of BERT for automated essay scoring: Joint learning of multi-scale essay representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Carpuat, M. de Marneffe, and I. V. M. Ruiz, Eds. Association for Computational Linguistics, 3416–3425.

Whitehill, J., and LoCasale-Crouch, J. 2024. Automated evaluation of classroom instructional support with LLMs and BoWs: Connecting global predictions to specific feedback. Journal of Educational Data Mining, 16(1), 34–60.

Xue, J., Tang, X., and Zheng, L. 2021. A hierarchical BERT-based transfer learning approach for multi-dimensional essay scoring. Ieee Access, 9, 125403–125415.

Ye, Z., Che, L., Ge, J., Qin, J., and Liu, J. 2024. Integration of multi-level semantics in PTMs with an attention model for question matching. PLoS ONE 19(8): e0305772.

Yoshida, L. 2024. The impact of example selection in few-shot prompting on automated essay scoring using GPT models. In International Conference on Artificial Intelligence in Education, A. M. Olney, I. A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt, Eds. Springer Cham, 61–73.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.

Zhong, R., Ghosh, D., Klein, D., and Steinhardt, J. 2021. Are larger pretrained language models uniformly better? comparing performance at the instance level. arXiv preprint arXiv:2105.06020.

Zhou, C., Sun, C., Liu, Z., and Lau, F. 2015. A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630.
Section
Special Section: Human-AI Partnership for Qualitative Analysis