Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback



Published Jun 27, 2024
Jacob Whitehill Jennifer LoCasale-Crouch


With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching,
we explore how Large Language Models (LLMs) can be used to estimate “Instructional Support”
domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol.
We design a machine learning architecture that uses either zero-shot prompting of Meta’s Llama2,
and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers’ speech (transcribed
automatically using OpenAI’s Whisper) for the presence of Instructional Support. Then, these
utterance-level judgments are aggregated over a 15-min observation session to estimate a global CLASS
score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate
that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson
R up to 0.48) approaches human inter-rater reliability (up to R = 0.55); (2) LLMs generally yield slightly
greater accuracy than BoW for this task, though the best models often combined features extracted from
both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement
of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model’s
outputs can be visualized at the utterance level to provide teachers with explainable feedback on which
utterances were most positively or negatively correlated with specific CLASS dimensions.

How to Cite

Whitehill, J., & LoCasale-Crouch, J. (2024). Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback. Journal of Educational Data Mining, 16(1), 34–60.
Abstract 28 | HTML Downloads 11 PDF Downloads 19



classroom observation, teacher feedback, machine learning, natural language processing, large language models

ANG, A. 2020. Solving non-negative least squares with l1-regularization.

BENGIO, Y. AND GRANDVALET, Y. 2003. No unbiased estimator of the variance of k-fold crossvalidation. In Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf, Eds. Vol. 16. MIT Press.

BURCHINAL, M. 2018. Measuring early care and education quality. Child Development Perspectives 12, 1, 3–9.

BURCHINAL, M. AND FARRAN, D. C. 2020. What does research tell us about ece programs. Foundation for Child Development, Getting It Right: Using Implementation Research to Improve Outcomes in Early Care and Education, 13–36.

BURCHINAL, M., VANDERGRIFT, N., PIANTA, R., AND MASHBURN, A. 2010. Threshold analysis of association between child care quality and child outcomes for low-income children in prekindergarten programs. Early childhood research quarterly 25, 2, 166–176.

DAI, Z., MCREYNOLDS, A., AND WHITEHILL, J. 2023. In search of negative moments: Multi-modal analysis of teacher negativity in classroom observation videos. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, Bengaluru, India, 278–285.

DEMSZKY, D., LIU, J., HILL, H. C., JURAFSKY, D., AND PIECH, C. 2023. Can automated feedback improve teachers’ uptake of student ideas? evidence from a randomized controlled trial in a largescale online course. Educational Evaluation and Policy Analysis, 01623737231169270.

DEMSZKY, D., LIU, J., MANCENIDO, Z., COHEN, J., HILL, H., JURAFSKY, D., AND HASHIMOTO, T. 2021. Measuring conversational uptake: A case study on student-teacher interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, Online, 1638–1653.

DEVLIN, J., CHANG, M.-W., LEE, K., AND TOUTANOVA, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.

GARDNER, J., BROOKS, C., AND BAKER, R. 2019. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. LAK19. Association for Computing Machinery, New York, NY, USA, 225–234.

GROSSMAN, P., COHEN, J., RONFELDT, M., AND BROWN, L. 2014. The test matters: The relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher 43, 6, 293–303.

HAMRE, B., HATFIELD, B., PIANTA, R., AND JAMIL, F. 2014. Evidence for general and domainspecific elements of teacher–child interactions: Associations with preschool children’s development. Child development 85, 3, 1257–1274.

HAMRE, B. K. 2014. Teachers’ daily interactions with children: An essential ingredient in effective early childhood programs. Child development perspectives 8, 4, 223–230.

HILL, H. C., BLUNK, M. L., CHARALAMBOUS, C. Y., LEWIS, J. M., PHELPS, G. C., SLEEP, L., AND BALL, D. L. 2008. Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and instruction 26, 4, 430–511.

HO, A. D. AND KANE, T. J. 2013. The reliability of classroom observations by school personnel. research paper. met project. Bill & Melinda Gates Foundation.

HU, E. J., SHEN, Y., WALLIS, P., ALLEN-ZHU, Z., LI, Y., WANG, S., WANG, L., AND CHEN, W. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

KANE, T. J., MCCAFFREY, D. F., MILLER, T., AND STAIGER, D. O. 2013. Have we identified effective teachers? validating measures of effective teaching using random assignment. Research Paper. MET Project. Bill & Melinda Gates Foundation.

KELLY, S., OLNEY, A. M., DONNELLY, P., NYSTRAND, M., AND D’MELLO, S. K. 2018. Automatically measuring question authenticity in real-world classrooms. Educational Researcher 47, 7, 451–464.

KOENECKE, A., NAM, A., LAKE, E., NUDELL, J., QUARTEY, M., MENGESHA, Z., TOUPS, C., RICKFORD, J. R., JURAFSKY, D., AND GOEL, S. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14, 7684–7689.

LESIAK, A. J., GRISWOLD, J. C., AND STARKS, H. 2021. Turning towards greater equity and access with online teacher professional development. Journal of STEM outreach 4, 3.

LOCASALE-CROUCH, J., DECOSTER, J., CABELL, S. Q., PIANTA, R. C., HAMRE, B. K., DOWNER, J. T., HATFIELD, B. E., LARSEN, R., BURCHINAL, M., HOWES, C., ET AL. 2016. Unpacking intervention effects: Teacher responsiveness as a mediator of perceived intervention quality and change in teaching practice. Early childhood research quarterly 36, 201–209.

LOCASALE-CROUCH, J., ROMO-ESCUDERO, F., CLAYBACK, K., WHITTAKER, J., HAMRE, B., AND MELO, C. 2023. Results from a randomized trial of the effective classroom interactions for toddler educators professional development intervention. Early Childhood Research Quarterly 65, 217–226.

MANNING, C. D., RAGHAVAN, P., AND SCHÜTZE, H. 2008. Introduction to information retrieval. Cambridge University Press.

MASHBURN, A. J., PIANTA, R. C., HAMRE, B. K., DOWNER, J. T., BARBARIN, O. A., BRYANT, D., BURCHINAL, M., EARLY, D. M., AND HOWES, C. 2008. Measures of classroom quality in prekindergarten and children’s development of academic, language, and social skills. Child development 79, 3, 732–749.

O’CONNOR, C., MICHAELS, S., AND CHAPIN, S. 2015. Scaling down to explore the role of talk in learning: From district intervention to controlled classroom study. Socializing intelligence through academic talk and dialogue, 111–126.

OLNEY, A. M., DONNELLY, P. J., SAMEI, B., AND D’MELLO, S. K. 2017. Assessing the dialogic properties of classroom discourse: Proportion models for imbalanced classes. In Proceedings of the International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society, 162–167.

ORLICH, D. C., HARDER, R. J., CALLAHAN, R. C., TREVISAN, M. S. T., AND BROWN, A. H. 2010. Teaching strategies: A guide to effective instruction. Wadsworth, Cengage Learning.

PARDOS, Z. A. AND BHANDARI, S. 2023. Learning gain differences between chatgpt and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871.

PENNINGTON, J., SOCHER, R., AND MANNING, C. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Moschitti, B. Pang, and W. Daelemans, Eds. Association for Computational Linguistics, Doha, Qatar, 1532–1543.

PERLMAN, M., FALENCHUK, O., FLETCHER, B., MCMULLEN, E., BEYENE, J., AND SHAH, P. S. 2016. A systematic review and meta-analysis of a measure of staff/child interaction quality (the classroom assessment scoring system) in early childhood education and care settings and child outcomes. PloS one 11, 12, e0167660.

PIANTA, R. AND BURCHINAL, M. 2016. National center for research on early childhood education teacher professional development study (2007-2011). Inter-university Consortium for Political and Social Research [distributor].

PIANTA, R., HAMRE, B., DOWNER, J., BURCHINAL, M., WILLIFORD, A., LOCASALE-CROUCH, J., HOWES, C., LA PARO, K., AND SCOTT-LITTLE, C. 2017. Early childhood professional development: Coaching and coursework effects on indicators of children’s school readiness. Early Education and Development 28, 8, 956–975.

PIANTA, R. C., LA PARO, K. M., AND HAMRE, B. K. 2008. Classroom Assessment Scoring System™: Manual K-3. Paul H Brookes Publishing.

PIANTA, R. C., WHITTAKER, J. E., VITIELLO, V., RUZEK, E., ANSARI, A., HOFKENS, T., AND DECOSTER, J. 2020. Children’s school readiness skills across the pre-k year: Associations with teacher-student interactions, teacher practices, and exposure to academic content. Journal of Applied Developmental Psychology 66, 101084.

RADFORD, A., KIM, J. W., XU, T., BROCKMAN, G., MCLEAVEY, C., AND SUTSKEVER, I. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds. Proceedings of Machine Learning Research, vol. 202. PMLR, 28492–28518.

REIMERS, N. AND GUREVYCH, I. 2019. Sentence-BERT: Sentence embeddings using Siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, Hong Kong, China, 3982–3992.

SURESH, A., JACOBS, J., LAI, V., TAN, C., WARD, W. H., MARTIN, J. H., AND SUMNER, T. R. 2021. Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. ArXiv abs/2105.07949.

TIBSHIRANI, R. 2014. Error and validation: Advanced methods for data analysis (36-402/36-608).

TONMOY, S., ZAMAN, S., JAIN, V., RANI, A., RAWTE, V., CHADHA, A., AND DAS, A. 2024. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313.

TOUVRON, H., MARTIN, L., STONE, K., ALBERT, P., ALMAHAIRI, A., BABAEI, Y., BASHLYKOV, N., BATRA, S., BHARGAVA, P., BHOSALE, S., ET AL. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

WANG, R. E. AND DEMSZKY, D. 2023. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. arXiv preprint arXiv:2306.03090.

WASIK, B. AND HINDMAN, A. 2011. Improving vocabulary and pre-literacy skills of at-risk preschoolers through teacher professional development. Journal of educational psychology 103, 2, 455.

WILLIAMS, C. K. AND RASMUSSEN, C. E. 2006. Gaussian processes for machine learning. Vol. 2. MIT press Cambridge, MA.

WU, S., FEI, H., QU, L., JI, W., AND CHUA, T.-S. 2023. Next-gpt: Any-to-any multimodal llm. CoRR abs/2309.05519.

YANG, Y., BUCKENDAHL, C. W., JUSZKIEWICZ, P. J., AND BHOLA, D. S. 2014. A review of strategies for validating computer-automated scoring. Advances in Computerized Scoring of Complex Item Formats, 391–412.

ZYLICH, B. AND WHITEHILL, J. 2020. Noise-robust key-phrase detectors for automated classroom feedback. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9215–9219.
EDM 2024 Journal Track