Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching,
we explore how Large Language Models (LLMs) can be used to estimate “Instructional Support”
domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol.
We design a machine learning architecture that uses either zero-shot prompting of Meta’s Llama2,
and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers’ speech (transcribed
automatically using OpenAI’s Whisper) for the presence of Instructional Support. Then, these
utterance-level judgments are aggregated over a 15-min observation session to estimate a global CLASS
score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate
that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson
R up to 0.48) approaches human inter-rater reliability (up to R = 0.55); (2) LLMs generally yield slightly
greater accuracy than BoW for this task, though the best models often combined features extracted from
both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement
of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model’s
outputs can be visualized at the utterance level to provide teachers with explainable feedback on which
utterances were most positively or negatively correlated with specific CLASS dimensions.
How to Cite
##plugins.themes.bootstrap3.article.details##
classroom observation, teacher feedback, machine learning, natural language processing, large language models
BENGIO, Y. AND GRANDVALET, Y. 2003. No unbiased estimator of the variance of k-fold crossvalidation. In Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf, Eds. Vol. 16. MIT Press.
BURCHINAL, M. 2018. Measuring early care and education quality. Child Development Perspectives 12, 1, 3–9.
BURCHINAL, M. AND FARRAN, D. C. 2020. What does research tell us about ece programs. Foundation for Child Development, Getting It Right: Using Implementation Research to Improve Outcomes in Early Care and Education, 13–36.
BURCHINAL, M., VANDERGRIFT, N., PIANTA, R., AND MASHBURN, A. 2010. Threshold analysis of association between child care quality and child outcomes for low-income children in prekindergarten programs. Early childhood research quarterly 25, 2, 166–176.
DAI, Z., MCREYNOLDS, A., AND WHITEHILL, J. 2023. In search of negative moments: Multi-modal analysis of teacher negativity in classroom observation videos. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, Bengaluru, India, 278–285.
DEMSZKY, D., LIU, J., HILL, H. C., JURAFSKY, D., AND PIECH, C. 2023. Can automated feedback improve teachers’ uptake of student ideas? evidence from a randomized controlled trial in a largescale online course. Educational Evaluation and Policy Analysis, 01623737231169270.
DEMSZKY, D., LIU, J., MANCENIDO, Z., COHEN, J., HILL, H., JURAFSKY, D., AND HASHIMOTO, T. 2021. Measuring conversational uptake: A case study on student-teacher interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, Online, 1638–1653.
DEVLIN, J., CHANG, M.-W., LEE, K., AND TOUTANOVA, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
GARDNER, J., BROOKS, C., AND BAKER, R. 2019. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. LAK19. Association for Computing Machinery, New York, NY, USA, 225–234.
GROSSMAN, P., COHEN, J., RONFELDT, M., AND BROWN, L. 2014. The test matters: The relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher 43, 6, 293–303.
HAMRE, B., HATFIELD, B., PIANTA, R., AND JAMIL, F. 2014. Evidence for general and domainspecific elements of teacher–child interactions: Associations with preschool children’s development. Child development 85, 3, 1257–1274.
HAMRE, B. K. 2014. Teachers’ daily interactions with children: An essential ingredient in effective early childhood programs. Child development perspectives 8, 4, 223–230.
HILL, H. C., BLUNK, M. L., CHARALAMBOUS, C. Y., LEWIS, J. M., PHELPS, G. C., SLEEP, L., AND BALL, D. L. 2008. Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and instruction 26, 4, 430–511.
HO, A. D. AND KANE, T. J. 2013. The reliability of classroom observations by school personnel. research paper. met project. Bill & Melinda Gates Foundation.
HU, E. J., SHEN, Y., WALLIS, P., ALLEN-ZHU, Z., LI, Y., WANG, S., WANG, L., AND CHEN, W. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
KANE, T. J., MCCAFFREY, D. F., MILLER, T., AND STAIGER, D. O. 2013. Have we identified effective teachers? validating measures of effective teaching using random assignment. Research Paper. MET Project. Bill & Melinda Gates Foundation.
KELLY, S., OLNEY, A. M., DONNELLY, P., NYSTRAND, M., AND D’MELLO, S. K. 2018. Automatically measuring question authenticity in real-world classrooms. Educational Researcher 47, 7, 451–464.
KOENECKE, A., NAM, A., LAKE, E., NUDELL, J., QUARTEY, M., MENGESHA, Z., TOUPS, C., RICKFORD, J. R., JURAFSKY, D., AND GOEL, S. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14, 7684–7689.
LESIAK, A. J., GRISWOLD, J. C., AND STARKS, H. 2021. Turning towards greater equity and access with online teacher professional development. Journal of STEM outreach 4, 3.
LOCASALE-CROUCH, J., DECOSTER, J., CABELL, S. Q., PIANTA, R. C., HAMRE, B. K., DOWNER, J. T., HATFIELD, B. E., LARSEN, R., BURCHINAL, M., HOWES, C., ET AL. 2016. Unpacking intervention effects: Teacher responsiveness as a mediator of perceived intervention quality and change in teaching practice. Early childhood research quarterly 36, 201–209.
LOCASALE-CROUCH, J., ROMO-ESCUDERO, F., CLAYBACK, K., WHITTAKER, J., HAMRE, B., AND MELO, C. 2023. Results from a randomized trial of the effective classroom interactions for toddler educators professional development intervention. Early Childhood Research Quarterly 65, 217–226.
MANNING, C. D., RAGHAVAN, P., AND SCHÜTZE, H. 2008. Introduction to information retrieval. Cambridge University Press.
MASHBURN, A. J., PIANTA, R. C., HAMRE, B. K., DOWNER, J. T., BARBARIN, O. A., BRYANT, D., BURCHINAL, M., EARLY, D. M., AND HOWES, C. 2008. Measures of classroom quality in prekindergarten and children’s development of academic, language, and social skills. Child development 79, 3, 732–749.
O’CONNOR, C., MICHAELS, S., AND CHAPIN, S. 2015. Scaling down to explore the role of talk in learning: From district intervention to controlled classroom study. Socializing intelligence through academic talk and dialogue, 111–126.
OLNEY, A. M., DONNELLY, P. J., SAMEI, B., AND D’MELLO, S. K. 2017. Assessing the dialogic properties of classroom discourse: Proportion models for imbalanced classes. In Proceedings of the International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society, 162–167.
ORLICH, D. C., HARDER, R. J., CALLAHAN, R. C., TREVISAN, M. S. T., AND BROWN, A. H. 2010. Teaching strategies: A guide to effective instruction. Wadsworth, Cengage Learning.
PARDOS, Z. A. AND BHANDARI, S. 2023. Learning gain differences between chatgpt and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871.
PENNINGTON, J., SOCHER, R., AND MANNING, C. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Moschitti, B. Pang, and W. Daelemans, Eds. Association for Computational Linguistics, Doha, Qatar, 1532–1543.
PERLMAN, M., FALENCHUK, O., FLETCHER, B., MCMULLEN, E., BEYENE, J., AND SHAH, P. S. 2016. A systematic review and meta-analysis of a measure of staff/child interaction quality (the classroom assessment scoring system) in early childhood education and care settings and child outcomes. PloS one 11, 12, e0167660.
PIANTA, R. AND BURCHINAL, M. 2016. National center for research on early childhood education teacher professional development study (2007-2011). Inter-university Consortium for Political and Social Research [distributor].
PIANTA, R., HAMRE, B., DOWNER, J., BURCHINAL, M., WILLIFORD, A., LOCASALE-CROUCH, J., HOWES, C., LA PARO, K., AND SCOTT-LITTLE, C. 2017. Early childhood professional development: Coaching and coursework effects on indicators of children’s school readiness. Early Education and Development 28, 8, 956–975.
PIANTA, R. C., LA PARO, K. M., AND HAMRE, B. K. 2008. Classroom Assessment Scoring System™: Manual K-3. Paul H Brookes Publishing.
PIANTA, R. C., WHITTAKER, J. E., VITIELLO, V., RUZEK, E., ANSARI, A., HOFKENS, T., AND DECOSTER, J. 2020. Children’s school readiness skills across the pre-k year: Associations with teacher-student interactions, teacher practices, and exposure to academic content. Journal of Applied Developmental Psychology 66, 101084.
RADFORD, A., KIM, J. W., XU, T., BROCKMAN, G., MCLEAVEY, C., AND SUTSKEVER, I. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds. Proceedings of Machine Learning Research, vol. 202. PMLR, 28492–28518.
REIMERS, N. AND GUREVYCH, I. 2019. Sentence-BERT: Sentence embeddings using Siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, Hong Kong, China, 3982–3992.
SURESH, A., JACOBS, J., LAI, V., TAN, C., WARD, W. H., MARTIN, J. H., AND SUMNER, T. R. 2021. Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. ArXiv abs/2105.07949.
TIBSHIRANI, R. 2014. Error and validation: Advanced methods for data analysis (36-402/36-608). https://www.stat.cmu.edu/~ryantibs/advmethods/notes/errval.pdf.
TONMOY, S., ZAMAN, S., JAIN, V., RANI, A., RAWTE, V., CHADHA, A., AND DAS, A. 2024. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313.
TOUVRON, H., MARTIN, L., STONE, K., ALBERT, P., ALMAHAIRI, A., BABAEI, Y., BASHLYKOV, N., BATRA, S., BHARGAVA, P., BHOSALE, S., ET AL. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
WANG, R. E. AND DEMSZKY, D. 2023. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. arXiv preprint arXiv:2306.03090.
WASIK, B. AND HINDMAN, A. 2011. Improving vocabulary and pre-literacy skills of at-risk preschoolers through teacher professional development. Journal of educational psychology 103, 2, 455.
WILLIAMS, C. K. AND RASMUSSEN, C. E. 2006. Gaussian processes for machine learning. Vol. 2. MIT press Cambridge, MA.
WU, S., FEI, H., QU, L., JI, W., AND CHUA, T.-S. 2023. Next-gpt: Any-to-any multimodal llm. CoRR abs/2309.05519.
YANG, Y., BUCKENDAHL, C. W., JUSZKIEWICZ, P. J., AND BHOLA, D. S. 2014. A review of strategies for validating computer-automated scoring. Advances in Computerized Scoring of Complex Item Formats, 391–412.
ZYLICH, B. AND WHITEHILL, J. 2020. Noise-robust key-phrase detectors for automated classroom feedback. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9215–9219.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.