Multi-Dimensional Performance Analysis of Large Language Models for Classroom Discussion Assessment

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Dec 23, 2024
Nhat Tran Benjamin Pierce Diane Litman Richard Correnti Lindsay Clare Matsumura

Abstract

Automatic scoring of classroom discussion quality is becoming increasingly feasible with the help of new natural language processing advancements such as large language models (LLMs). Whether scores produced by LLMs can be used to make valid inferences about discussion quality at scale remains less clear. In this work, we examine how the assessment performance of two LLMs interacts with three factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the two LLMs. Our results suggest that the three aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. Using these results in conjunction with data from a randomized controlled trial, we then examine whether LLM-based assessment approaches that have a practical balance of predictive performance, computational efficiency, and consistency can be used to identify growth in discussion quality. We find that the best-performing LLM methods partially replicate results derived from human scores.

How to Cite

Tran, N., Pierce, B., Litman, D., Correnti, R., & Matsumura, L. C. (2024). Multi-Dimensional Performance Analysis of Large Language Models for Classroom Discussion Assessment. Journal of Educational Data Mining, 16(2), 304–335. https://doi.org/10.5281/zenodo.14549071
Abstract 154 | HTML Downloads 35 PDF Downloads 55

##plugins.themes.bootstrap3.article.details##

Keywords

classroom discussion, large language models, scoring, reliability

References
Alic, S., Demszky, D., Mancenido, Z., Liu, J., Hill, H., and Jurafsky, D. 2022. Computationally identifying funneling and focusing questions in classroom discourse. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, and T. Zesch, Eds. Association for Computational Linguistics, Seattle, Washington, 224–233.

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. Standards for educational & psychological testing.

Baker, R. S. and Hawn, A. 2022. Algorithmic bias in education. International Journal of Artificial Intelligence in Education 32, 4 (Dec), 1052–1092.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. T. Hadsell, M. F. Balcan, and H. Lin, Eds. NIPS ’20, vol. 33. Curran Associates Inc., Red Hook, NY, USA, 1877–1901.

Correnti, R., Matsumura, L. C., Walsh, M., Zook-Howell, D., Bickel, D. D., and Yu, B. 2021. Effects of online content-focused coaching on discussion quality and reading achievement: Building theory for how coaching develops teachers’ adaptive expertise. Reading Research Quarterly 56, 3, 519–558.

Correnti, R., Stein, M. K., Smith, M. S., Scherrer, J., McKeown, M. G., Greeno, J. G., and Ashley, K. 2015. Improving Teaching at Scale: Design for the Scientific Measurement and Learning of Discourse Practice. American Educational Research Association, 315–332.

Demszky, D. and Liu, J. 2023. M-powering teachers: Natural language processing powered feedback improves 1:1 instruction and student outcomes. In Proceedings of the Tenth ACM Conference on Learning @ Scale. L@S ’23. Association for Computing Machinery, New York, NY, USA, 59—-69.

Demszky, D., Liu, J., Mancenido, Z., Cohen, J., Hill, H., Jurafsky, D., and Hashimoto, T. 2021. Measuring conversational uptake: A case study on student-teacher interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, Online, 1638–1653.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.

Hämäläinen, P., Tavast, M., and Kunnari, A. 2023. Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. N. Peters, S. Mueller, J. R. Williamson, and M. L. Wilson, Eds. CHI ’23. Association for Computing Machinery, New York, NY, USA, 1–19.

Hennessy, S., Howe, C., Mercer, N., and Vrikki, M. 2020. Coding classroom dialogue: Methodological considerations for researchers. Learning, Culture, and Social Interaction 25, 100404.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. 2022. LoRA: Low-rank adaptation of large language models.

Huang, Z., Xu, W., and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.

Jacobs, J., Scornavacco, K., Harty, C., Suresh, A., Lai, V., and Sumner, T. 2022. Promoting rich discussions in mathematics classrooms: Using personalized, automated feedback to support reflection and instructional change. Teaching and Teacher Education 112, 103631.

Jensen, E., Dale, M., Donnelly, P. J., Stone, C., Kelly, S., Godley, A., and D’Mello, S. K. 2020. Toward automated feedback on teacher discourse to enhance teacher learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. CHI ’20. Association for Computing Machinery, New York, NY, USA, 1–13.

Jensen, E., L. Pugh, S., and K. D’Mello, S. 2021. A deep transfer learning approach to modeling teacher discourse in the classroom. In LAK21: 11th International Learning Analytics and Knowledge Conference. LAK21. Association for Computing Machinery, New York, NY, USA, 302–312.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. 2023. Mistral 7b. CoRR abs/2310.06825.

Jiang, Z., Araki, J., Ding, H., and Neubig, G. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9, 962–977.

Junker, B. W., Weisberg, Y., Matsumura, L. C., Crosson, A., Wolf, M., Levison, A., and Resnick, L. 2005. Overview of the instructional quality assessment. Regents of the University of California Oakland, CA.

Kane, M., Crooks, T., and Cohen, A. 1999. Validating measures of performance. Educational measurement: Issues and practice 18, 2 (Sum.), 5–17.

Kane, M. T. 2013. Validating the interpretation and uses of test scores. Journal of Educational Measurement 50, 1, 1–73.

Karmaker Santu, S. K. and Feng, D. 2023. TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, Singapore, 14197–14203.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. 2024. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrav, K. Cho, and A. Oh, Eds. NIPS ’22. Curran Associates Inc., Red Hook, NY, USA, 22199 – 22213.

Kupor, A., Morgan, C., and Demszky, D. 2023. Measuring five accountable talk moves to improve instruction at scale. arXiv preprint.

Li, Z., Zhu, H., Lu, Z., and Yin, M. 2023. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, Singapore, 10443–10461.

Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, E. Agirre, M. Apidianaki, and I. Vulić, Eds. Association for Computational Linguistics, Dublin, Ireland and Online, 100–114.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692.

Lugini, L. and Litman, D. 2018. Argument component classification for classroom discussions. In Proceedings of the 5th Workshop on Argument Mining, N. Slonim and R. Aharonov, Eds. Association for Computational Linguistics, Brussels, Belgium, 57–67.

Lugini, L. and Litman, D. 2020. Contextual argument component classification for class discussions. In Proceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1475–1480.

Lugini, L., Litman, D., Godley, A., and Olshefski, C. 2018. Annotating student talk in text-based classroom discussions. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, J. Tetreault, J. Burstein, E. Kochmar, C. Leacock, and H. Yannakoudakis, Eds. Association for Computational Linguistics, New Orleans, Louisiana, 110–116.

Matsumura, L. C., Correnti, R., Litman, D., Pierce, B., and Tran, N. 2024. Automated measures of classroom discussion quality for research inferences. Under review.

Matsumura, L. C., Garnier, H. E., Slater, S. C., and Boston, M. D. 2008. Toward measuring instructional interactions “at-scale”. Educational Assessment 13, 4, 267–300.

Mercer, N. 2010. The analysis of classroom talk: Methods and methodologies. British Journal of Educational Psychology 80, 1, 1–14.

Messick, S. 1989. Meaning and values in test validation: The science and ethics of assessment. Educational Research 18, 2 (Mar.), 5–11.

Moss, P. A. 2016. Shifting the focus of validity for test use. Assessment in Education: Principles, Policy, and Practice 23, 2, 236–251.

Nazaretsky, T., Mikeska, J. N., and Beigman Klebanov, B. 2023. Empowering teacher learning with ai: Automated evaluation of teacher attention to student ideas during argumentation-focused discussion. In LAK23: 13th International Learning Analytics and Knowledge Conference. LAK2023. Association for Computing Machinery, New York, NY, USA, 122–132.

Olshefski, C., Lugini, L., Singh, R., Litman, D., and Godley, A. 2020. The discussion tracker corpus of collaborative argumentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds. European Language Resources Association, Marseille, France, 1033–1043.

Pan, L., Wu, X., Lu, X., Luu, A. T., Wang, W. Y., Kan, M.-Y., and Nakov, P. 2023. Fact-checking complex claims with program-guided reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, Toronto, Canada, 6981–7004.

Panchendrarajan, R. and Amaresan, A. 2018. Bidirectional LSTM-CRF for named entity recognition. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, S. Politzer-Ahles, Y.-Y. Hsu, C.-R. Huang, and Y. Yao, Eds. Association for Computational Linguistics, Hong Kong.

Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., and Kusich, K. 2002. Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research 26, 4, 407–426.

Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., and Kusich, K. 2015. Validating automated essay scoring: A (modest) refinement of the “gold standard”. Applied Measurement in Education 28, 2, 130–142.

Raudenbush, S. W. and Bryk, A. S. 2002. Hierarchical linear models: Applications and data analysis methods. Sage.

Robinson, J. D., Chuang, C., Sra, S., and Jegelka, S. 2021. Contrastive learning with hard negative samples. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, Austria.

Sun, S., Krishna, K., Mattarella-Micke, A., and Iyyer, M. 2021. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 807–822.

Suresh, A., Jacobs, J., Lai, V., Tan, C., Ward, W., Martin, J. H., and Sumner, T. 2021. Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. In In the Proceedings of the Spring AAAI 2021 Symposium on Artificial Intelligence for K-12 Education.

Suresh, A., Sumner, T., Jacobs, J., Foland, B., and Ward, W. 2019. Automating analysis and feedback to improve mathematics teachers’ classroom discourse. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul.), 9721–9728.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv abs/2307.09288.

Tran, N., Pierce, B., Litman, D., Correnti, R., and Matsumura, L. C. 2023. Utilizing natural language processing for automated assessment of classroom discussion. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, and O. C. Santos, Eds. Springer Nature Switzerland, Cham, 490–496.

Tran, N., Pierce, B., Litman, D., Correnti, R., and Matsumura, L. C. 2024. Analyzing large language models for classroom discussion assessment. In Proceedings of the 17th International Conference on Educational Data Mining, B. Paassen and C. D. Epp, Eds. International Educational Data Mining Society, Atlanta, Georgia, USA, 500–510.

Wang, D., Shan, D., Zheng, Y., Guo, K., Chen, G., and Lu, Y. 2023. Can chatgpt detect student talk moves in classroom discourse? a preliminary comparison with bert. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, Bengaluru, India, 515–519.

Wang, L., Yang, N., and Wei, F. 2024. Learning to retrieve in-context examples for large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Y. Graham and M. Purver, Eds. Association for Computational Linguistics, St. Julian’s, Malta, 1752–1767.

Wang, R. and Demszky, D. 2023. Is ChatGPT a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, and T. Zesch, Eds. Association for Computational Linguistics, Toronto, Canada, 626–667.

Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda.

Whitehill, J. and LoCasale-Crouch, J. 2024. Automated evaluation of classroom instructional support with llms and bows: Connecting global predictions to specific feedback. Journal of Educational Data Mining 16, 1 (Jun.), 34–60.

Wilkinson, I. A. G., Murphy, P. K., and Binici, S. 2015. Dialogue-Intensive Pedagogies for Promoting Reading Comprehension: What We Know, What We Need to Know. American Educational Research Association, 37–50.

Xu, P., Liu, J., Jones, N., Cohen, J., and Ai, W. 2024. The promises and pitfalls of using language models to measure instruction quality in education. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Association for Computational Linguistics, Mexico City, Mexico, 4375–4389.

Zechner, K. and Loukina, A. 2020. Automated scoring of extended spontaneous speech. In Handbook of Automated Scoring: Theory into Practice, D. Yan, A. Rapp, and P. Foltz, Eds. Chapman and Hall/CRC, 365–382.

Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh, S. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, Eds. Proceedings of Machine Learning Research, vol. 139. PMLR, 12697–12706.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds. Vol. 36. Curran Associates, Inc., 46595–46623.

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. 2023. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda.
Section
Extended Articles from the EDM 2024 Conference