Multi-Dimensional Performance Analysis of Large Language Models for Classroom Discussion Assessment

Nhat Tran; Benjamin Pierce; Diane Litman; Richard Correnti; Lindsay Clare Matsumura

doi:10.5281/zenodo.14549071

Multi-Dimensional Performance Analysis of Large Language Models for Classroom Discussion Assessment

HTML PDF

Published December 23, 2024

DOI: https://doi.org/10.5281/zenodo.14549071

Nhat Tran

University of Pittsburgh

Benjamin Pierce

University of Pittsburgh

Diane Litman

University of Pittsburgh

https://orcid.org/0000-0001-7282-7531

Richard Correnti

University of Pittsburgh

Lindsay Clare Matsumura

University of Pittsburgh

Abstract

Automatic scoring of classroom discussion quality is becoming increasingly feasible with the help of new natural language processing advancements such as large language models (LLMs). Whether scores produced by LLMs can be used to make valid inferences about discussion quality at scale remains less clear. In this work, we examine how the assessment performance of two LLMs interacts with three factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the two LLMs. Our results suggest that the three aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. Using these results in conjunction with data from a randomized controlled trial, we then examine whether LLM-based assessment approaches that have a practical balance of predictive performance, computational efficiency, and consistency can be used to identify growth in discussion quality. We find that the best-performing LLM methods partially replicate results derived from human scores.

How to Cite

Multi-Dimensional Performance Analysis of Large Language Models for Classroom Discussion Assessment. (2024). Journal of Educational Data Mining, 16(2), 304-335. https://doi.org/10.5281/zenodo.14549071

Abstract 669 | HTML Downloads 417 PDF Downloads 263

Keywords

classroom discussion, large language models, scoring, reliability

References

Alic, S., Demszky, D., Mancenido, Z., Liu, J., Hill, H., and Jurafsky, D. 2022. Computationally identifying funneling and focusing questions in classroom discourse. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, and T. Zesch, Eds. Association for Computational Linguistics, Seattle, Washington, 224–233.

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. Standards for educational & psychological testing.

Baker, R. S. and Hawn, A. 2022. Algorithmic bias in education. International Journal of Artificial Intelligence in Education 32, 4 (Dec), 1052–1092.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. T. Hadsell, M. F. Balcan, and H. Lin, Eds. NIPS ’20, vol. 33. Curran Associates Inc., Red Hook, NY, USA, 1877–1901.

Correnti, R., Matsumura, L. C., Walsh, M., Zook-Howell, D., Bickel, D. D., and Yu, B. 2021. Effects of online content-focused coaching on discussion quality and reading achievement: Building theory for how coaching develops teachers’ adaptive expertise. Reading Research Quarterly 56, 3, 519–558.

Correnti, R., Stein, M. K., Smith, M. S., Scherrer, J., McKeown, M. G., Greeno, J. G., and Ashley, K. 2015. Improving Teaching at Scale: Design for the Scientific Measurement and Learning of Discourse Practice. American Educational Research Association, 315–332.

Demszky, D. and Liu, J. 2023. M-powering teachers: Natural language processing powered feedback improves 1:1 instruction and student outcomes. In Proceedings of the Tenth ACM Conference on Learning @ Scale. L@S ’23. Association for Computing Machinery, New York, NY, USA, 59—-69.

Demszky, D., Liu, J., Mancenido, Z., Cohen, J., Hill, H., Jurafsky, D., and Hashimoto, T. 2021. Measuring conversational uptake: A case study on student-teacher interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, Online, 1638–1653.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.

Hämäläinen, P., Tavast, M., and Kunnari, A. 2023. Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. N. Peters, S. Mueller, J. R. Williamson, and M. L. Wilson, Eds. CHI ’23. Association for Computing Machinery, New York, NY, USA, 1–19.

Hennessy, S., Howe, C., Mercer, N., and Vrikki, M. 2020. Coding classroom dialogue: Methodological considerations for researchers. Learning, Culture, and Social Interaction 25, 100404.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. 2022. LoRA: Low-rank adaptation of large language models.

Huang, Z., Xu, W., and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.

Jacobs, J., Scornavacco, K., Harty, C., Suresh, A., Lai, V., and Sumner, T. 2022. Promoting rich discussions in mathematics classrooms: Using personalized, automated feedback to support reflection and instructional change. Teaching and Teacher Education 112, 103631.

Jensen, E., Dale, M., Donnelly, P. J., Stone, C., Kelly, S., Godley, A., and D’Mello, S. K. 2020. Toward automated feedback on teacher discourse to enhance teacher learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. CHI ’20. Association for Computing Machinery, New York, NY, USA, 1–13.

Jensen, E., L. Pugh, S., and K. D’Mello, S. 2021. A deep transfer learning approach to modeling teacher discourse in the classroom. In LAK21: 11th International Learning Analytics and Knowledge Conference. LAK21. Association for Computing Machinery, New York, NY, USA, 302–312.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. 2023. Mistral 7b. CoRR abs/2310.06825.

Jiang, Z., Araki, J., Ding, H., and Neubig, G. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9, 962–977.

Junker, B. W., Weisberg, Y., Matsumura, L. C., Crosson, A., Wolf, M., Levison, A., and Resnick, L. 2005. Overview of the instructional quality assessment. Regents of the University of California Oakland, CA.

Kane, M., Crooks, T., and Cohen, A. 1999. Validating measures of performance. Educational measurement: Issues and practice 18, 2 (Sum.), 5–17.

Kane, M. T. 2013. Validating the interpretation and uses of test scores. Journal of Educational Measurement 50, 1, 1–73.

Karmaker Santu, S. K. and Feng, D. 2023. TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, Singapore, 14197–14203.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. 2024. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrav, K. Cho, and A. Oh, Eds. NIPS ’22. Curran Associates Inc., Red Hook, NY, USA, 22199 – 22213.

Kupor, A., Morgan, C., and Demszky, D. 2023. Measuring five accountable talk moves to improve instruction at scale. arXiv preprint.

Li, Z., Zhu, H., Lu, Z., and Yin, M. 2023. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, Singapore, 10443–10461.

Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, E. Agirre, M. Apidianaki, and I. Vulić, Eds. Association for Computational Linguistics, Dublin, Ireland and Online, 100–114.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692.

Lugini, L. and Litman, D. 2018. Argument component classification for classroom discussions. In Proceedings of the 5th Workshop on Argument Mining, N. Slonim and R. Aharonov, Eds. Association for Computational Linguistics, Brussels, Belgium, 57–67.

Lugini, L. and Litman, D. 2020. Contextual argument component classification for class discussions. In Proceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1475–1480.

Lugini, L., Litman, D., Godley, A., and Olshefski, C. 2018. Annotating student talk in text-based classroom discussions. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, J. Tetreault, J. Burstein, E. Kochmar, C. Leacock, and H. Yannakoudakis, Eds. Association for Computational Linguistics, New Orleans, Louisiana, 110–116.

Matsumura, L. C., Correnti, R., Litman, D., Pierce, B., and Tran, N. 2024. Automated measures of classroom discussion quality for research inferences. Under review.

Matsumura, L. C., Garnier, H. E., Slater, S. C., and Boston, M. D. 2008. Toward measuring instructional interactions “at-scale”. Educational Assessment 13, 4, 267–300.

Mercer, N. 2010. The analysis of classroom talk: Methods and methodologies. British Journal of Educational Psychology 80, 1, 1–14.

Messick, S. 1989. Meaning and values in test validation: The science and ethics of assessment. Educational Research 18, 2 (Mar.), 5–11.

Moss, P. A. 2016. Shifting the focus of validity for test use. Assessment in Education: Principles, Policy, and Practice 23, 2, 236–251.

Nazaretsky, T., Mikeska, J. N., and Beigman Klebanov, B. 2023. Empowering teacher learning with ai: Automated evaluation of teacher attention to student ideas during argumentation-focused discussion. In LAK23: 13th International Learning Analytics and Knowledge Conference. LAK2023. Association for Computing Machinery, New York, NY, USA, 122–132.

Olshefski, C., Lugini, L., Singh, R., Litman, D., and Godley, A. 2020. The discussion tracker corpus of collaborative argumentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds. European Language Resources Association, Marseille, France, 1033–1043.

Pan, L., Wu, X., Lu, X., Luu, A. T., Wang, W. Y., Kan, M.-Y., and Nakov, P. 2023. Fact-checking complex claims with program-guided reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, Toronto, Canada, 6981–7004.

Panchendrarajan, R. and Amaresan, A. 2018. Bidirectional LSTM-CRF for named entity recognition. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, S. Politzer-Ahles, Y.-Y. Hsu, C.-R. Huang, and Y. Yao, Eds. Association for Computational Linguistics, Hong Kong.

Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., and Kusich, K. 2002. Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research 26, 4, 407–426.

Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., and Kusich, K. 2015. Validating automated essay scoring: A (modest) refinement of the “gold standard”. Applied Measurement in Education 28, 2, 130–142.

Raudenbush, S. W. and Bryk, A. S. 2002. Hierarchical linear models: Applications and data analysis methods. Sage.

Robinson, J. D., Chuang, C., Sra, S., and Jegelka, S. 2021. Contrastive learning with hard negative samples. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, Austria.

Sun, S., Krishna, K., Mattarella-Micke, A., and Iyyer, M. 2021. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 807–822.

Suresh, A., Jacobs, J., Lai, V., Tan, C., Ward, W., Martin, J. H., and Sumner, T. 2021. Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. In In the Proceedings of the Spring AAAI 2021 Symposium on Artificial Intelligence for K-12 Education.

Suresh, A., Sumner, T., Jacobs, J., Foland, B., and Ward, W. 2019. Automating analysis and feedback to improve mathematics teachers’ classroom discourse. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul.), 9721–9728.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv abs/2307.09288.

Tran, N., Pierce, B., Litman, D., Correnti, R., and Matsumura, L. C. 2023. Utilizing natural language processing for automated assessment of classroom discussion. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, and O. C. Santos, Eds. Springer Nature Switzerland, Cham, 490–496.

Tran, N., Pierce, B., Litman, D., Correnti, R., and Matsumura, L. C. 2024. Analyzing large language models for classroom discussion assessment. In Proceedings of the 17th International Conference on Educational Data Mining, B. Paassen and C. D. Epp, Eds. International Educational Data Mining Society, Atlanta, Georgia, USA, 500–510.

Wang, D., Shan, D., Zheng, Y., Guo, K., Chen, G., and Lu, Y. 2023. Can chatgpt detect student talk moves in classroom discourse? a preliminary comparison with bert. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, Bengaluru, India, 515–519.

Wang, L., Yang, N., and Wei, F. 2024. Learning to retrieve in-context examples for large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Y. Graham and M. Purver, Eds. Association for Computational Linguistics, St. Julian’s, Malta, 1752–1767.

Wang, R. and Demszky, D. 2023. Is ChatGPT a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, and T. Zesch, Eds. Association for Computational Linguistics, Toronto, Canada, 626–667.

Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda.

Whitehill, J. and LoCasale-Crouch, J. 2024. Automated evaluation of classroom instructional support with llms and bows: Connecting global predictions to specific feedback. Journal of Educational Data Mining 16, 1 (Jun.), 34–60.

Wilkinson, I. A. G., Murphy, P. K., and Binici, S. 2015. Dialogue-Intensive Pedagogies for Promoting Reading Comprehension: What We Know, What We Need to Know. American Educational Research Association, 37–50.

Xu, P., Liu, J., Jones, N., Cohen, J., and Ai, W. 2024. The promises and pitfalls of using language models to measure instruction quality in education. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Association for Computational Linguistics, Mexico City, Mexico, 4375–4389.

Zechner, K. and Loukina, A. 2020. Automated scoring of extended spontaneous speech. In Handbook of Automated Scoring: Theory into Practice, D. Yan, A. Rapp, and P. Foltz, Eds. Chapman and Hall/CRC, 365–382.

Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh, S. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, Eds. Proceedings of Machine Learning Research, vol. 139. PMLR, 12697–12706.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds. Vol. 36. Curran Associates, Inc., 46595–46623.

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. 2023. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda.

Issue

Vol 16 No 2 (2024)

Section

Extended Articles from the EDM 2024 Conference

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Authors who publish with this journal agree to the following terms:

The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:

Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.

The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
The Author represents and warrants that:

the Work is the Author’s original work;
the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
the Work is not pending review or under consideration by another publisher;
the Work has not previously been published;
the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
the Work contains no libel, invasion of privacy, or other unlawful matter.

The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.

Main

Sidebar

Abstract

How to Cite

Details