Optimizing Speaker Diarization for the Classroom: Applications in Timing Student Speech and Distinguishing Teachers from Children

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Feb 14, 2025
Jiani Wang Shiran Dudy Xinlu He Zhiyong Wang Rosy Southwell Jacob Whitehill

Abstract

An important dimension of classroom group dynamics & collaboration is how much each person contributes to the discussion. With the goal of distinguishing teachers' speech from children's speech and measuring how much each student speaks, we have investigated how automatic speaker diarization can be built to handle real-world classroom group discussions. We examined key design considerations such as the level of granularity of speaker assignment, speech enhancement techniques, voice activity detection, and embedding assignment methods to find an effective configuration. The best speaker diarization system we found was based on the ECAPA-TDNN speaker embedding model and used Whisper automatic speech recognition to identify speech segments. The diarization error rate (DER) in challenging noisy spontaneous classroom data was around 34%, and the correlations of estimated vs. human annotations of how much each student spoke reached 0.62. The accuracy of distinguishing teachers' speech from children's speech was 69.17%. We evaluated the system for potential accuracy bias across people of different skin tones and genders and found that the accuracy did not show statistically significantly differences across either dimension. Thus, the presented diarization system has potential to benefit educational research and to provide teachers and students with useful feedback to better understand their classroom dynamics.

How to Cite

Wang, J., Dudy, S., He, X., Wang, Z., Southwell, R., & Whitehill, J. (2025). Optimizing Speaker Diarization for the Classroom: Applications in Timing Student Speech and Distinguishing Teachers from Children. Journal of Educational Data Mining, 17(1), 98–125. https://doi.org/10.5281/zenodo.14871875
Abstract 35 | PDF Downloads 28 HTML Downloads 81

##plugins.themes.bootstrap3.article.details##

Keywords

automatic speech recognition, automatic classroom analysis, group collaboration, speaker diarization

References
Alharbi, W. 2023. AI in the foreign language classroom: A pedagogical overview of automated writing assistance tools. Education Research International 2023, 1, 4253331.

Amazon. 2021. Amazon transcribe.

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., and Vinyals, O. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20, 2, 356–370.

Araki, S., Fujimoto, M., Ishizuka, K., Sawada, H., and Makino, S. 2008. A DOA based speaker diarization system for real meetings. In 2008 Hands-Free Speech Communication and Microphone Arrays. IEEE, 29–32.

Bain, M., Huh, J., Han, T., and Zisserman, A. 2023. Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747.

Bao, W. 2020. Covid-19 and online teaching in higher education: A case study of Peking University. Human Behavior and Emerging Technologies 2, 2, 113–115.

Beccaro, W., Arjona Ramírez, M., Liaw, W., and Guimarães, H. R. 2024. Analysis of oral exams with speaker diarization and speech emotion recognition: A case study. IEEE Transactions on Education 67, 1, 74–86.

Bredin, H. 2017. pyannote. metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), F. Lacerda, Ed. International Speech Communication Association (ISCA), 3587–3591.

Cao, J., Ganesh, A., Cai, J., Southwell, R., Perkoff, E. M., Regan, M., Kann, K., Martin, J. H., Palmer, M., and D’Mello, S. 2023. A comparative analysis of automatic speech recognition errors in small group classroom discourse. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization. Association for Computing Machinery, New York, NY, USA, 250–262.

Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., and Wei, F. 2022. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6, 1505–1518.

Desplanques, B., Thienpondt, J., and Demuynck, K. 2020. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.

Dutta, S., Irvin, D., Buzhardt, J., and Hansen, J. H. 2022. Activity focused speech recognition of preschool children in early childhood classrooms. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). Association for Computational Linguistics, Seattle, Washington, 92–100.

Fitzpatrick, T. B. 1988. The Validity and Practicality of Sun-Reactive Skin Types I Through VI. Archives of Dermatology 124, 6 (06), 869–871.

Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., and Watanabe, S. 2019. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 296–303.

Fung, D. C.-L., To, H., and Leung, K. 2016. The influence of collaborative group work on students’ development of critical thinking: the teacher’s role in facilitating group discussions. Pedagogies: An International Journal 11, 2, 146–166.

Gazawy, Q., Buyrukoglu, S., and Akbas, A. 2023. Deep learning for enhanced education quality: Assessing student engagement and emotional states. In 2023 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 1–8.

Gomez, A., Pattichis, M. S., and Celedón-Pattichis, S. 2022. Speaker diarization and identification from single channel classroom audio recordings using virtual microphones. IEEE Access 10, 56256–56266.

GoogleCloud. 2021. Detect different speakers in an audio recording.

Hagen, A., Pellom, B., and Cole, R. 2003. Children’s speech recognition with application to interactive books and tutors. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721). IEEE, 186–191.

Harrington, J., Palethorpe, S., Watson, C. I., et al. 2007. Age-related changes in fundamental frequency and formants: a longitudinal study of four speakers. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). International Speech Communication Association (ISCA), 2753–2756.

Hazirbas, C., Bitton, J., Dolhansky, B., Pan, J., Gordo, A., and Ferrer, C. C. 2022. Towards measuring fairness in AI: The casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science 4, 3, 324–332.

He, M.-K., Du, J., and Lee, C.-H. 2022. End-to-end audio-visual neural speaker diarization. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). International Speech Communication Association, 1461–1465.

He, X., Wang, J., Trinh, V. A., McReynolds, A., and Whitehill, J. 2024. Tracking classroom movement patterns with person re-ID. In Proceedings of the 17th International Conference on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data Mining Society, 679–685.

Howard, J. R. 2015. Discussion in the college classroom: Getting your students engaged and participating in person and online. John Wiley & Sons, San Francisco.

Kang, W., Roy, B. C., and Chow, W. 2020. Multimodal speaker diarization of real-world meetings using d-vectors with spatial features. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6509–6513.

Kelly, S., Olney, A. M., Donnelly, P., Nystrand, M., and D’Mello, S. K. 2018. Automatically measuring question authenticity in real-world classrooms. Educational Researcher 47, 7, 451–464.

Landini, F., Profant, J., Diez, M., and Burget, L. 2022. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech Language 71, 101254.

Luczynski, K. C. and Hanley, G. P. 2013. Prevention of problem behavior by teaching functional communication and self-control skills to preschoolers. Journal of Applied Behavior Analysis 46, 2, 355–368.

Mahmood, S. 2021. Instructional strategies for online teaching in covid-19 pandemic. Human Behavior and Emerging Technologies 3, 1, 199–203.

Markov, K. and Nakamura, S. 2008. Improved novelty detection for online GMM based speaker diarization. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). International Speech Communication Association, 363–366.

Olney, A. M., Donnelly, P. J., Samei, B., and D’Mello, S. K. 2017. Assessing the dialogic properties of classroom discourse: Proportion models for imbalanced classes. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society, 162–167.

Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., and Narayanan, S. 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech Language 72, 101317.

Pianta, R. and Burchinal, M. 2016. National center for research on early childhood education teacher professional development study (2007-2011). https://doi.org/10.3886/ICPSR34848.v2.

Pianta, R., Hamre, B., Downer, J., Burchinal, M., Williford, A., Locasale-Crouch, J., Howes, C., La Paro, K., and Scott-Little, C. 2017. Early childhood professional development: Coaching and coursework effects on indicators of children’s school readiness. Early Education and Development 28, 8, 956–975.

Pradhan, S. S., Cole, R. A., and Ward, W. H. 2023. My science tutor (myst)–a large corpus of children’s conversational speech. In International Conference on Language Resources and Evaluation.

Quansah, F. 2018. Traditional or performance assessment: What is the right way to assessing learners. Research on Humanities and Social Sciences 8, 1, 21–24.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, JMLR.org, 28492–28518.

Ramakrishnan, A., Zylich, B., Ottmar, E., LoCasale-Crouch, J., and Whitehill, J. 2023. Toward automated classroom observation: Multimodal machine learning to estimate class positive climate and negative climate. IEEE Transactions on Affective Computing 14, 1, 664–679.

Ravanelli, M., Parcollet, T., Moumen, A., de Langen, S., Subakan, C., Plantinga, P., Wang, Y., Mousavi, P., Libera, L. D., Ploujnikov, A., Paissan, F., Borra, D., Zaiem, S., Zhao, Z., Zhang, S., Karakasidis, G., Yeh, S.-L., Champion, P., Rouhe, A., Braun, R., Mai, F., Zuluaga-Gomez, J., Mousavi, S. M., Nautsch, A., Liu, X., Sagar, S., Duret, J., Mdhaffar, S., Laperriere, G., Rouvier, M., Mori, R. D., and Esteve, Y. 2024. Open-source conversational AI with SpeechBrain 1.0.

Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., Mori, R. D., and Bengio, Y. 2021. Speechbrain: A general-purpose speech toolkit.

Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. 2000. Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10, 1-3, 19–41.

Rouvier, M., Bousquet, P.-M., and Favre, B. 2015. Speaker diarization through speaker embeddings. In 2015 23rd European Signal Processing Conference (EUSIPCO). IEEE, 2082–2086.

Sedova, K., Sedlacek, M., Svaricek, R., Majcik, M., Navratilova, J., Drexlerova, A., Kychler, J., and Salamounova, Z. 2019. Do those who talk more learn more? the relationship between student classroom talk and student achievement. Learning and Instruction 63, 101217.

Shobaki, Khaldoun, Hosom, John-Paul, and Cole, Ronald Allan. 2007. CSLU: Kids‘ Speech Version 1.1.

Silero Team. 2024. Silero VAD: pre-trained enterprise-grade voice activity detector (VAD), number detector and language classifier. https://github.com/snakers4/silero-vad.

Southwell, R., Pugh, S. L., Perkoff, E. M., Clevenger, C., Bush, J. B., Lieber, R., Ward, W. H., Foltz, P. W., and D’Mello, S. K. 2022. Challenges and feasibility of automatic speech recognition for modeling student collaborative discourse in classrooms. In Educational Data Mining. International Educational Data Mining Society.

Sümer, Ö., Goldberg, P., D’Mello, S., Gerjets, P., Trautwein, U., and Kasneci, E. 2021. Multimodal engagement analysis from facial videos in the classroom. IEEE Transactions on Affective Computing 14, 2, 1012–1027.

Thomas, D. R., Lin, J., Bhushan, S., Abboud, R., Gatz, E., Gupta, S., and Koedinger, K. R. 2024. Learning and ai evaluation of tutors responding to students engaging in negative self-talk. In Proceedings of the Eleventh ACM Conference on Learning @ Scale. L@S ’24. Association for Computing Machinery, New York, NY, USA, 481–485.

Wang, J., Dudy, S., He, X., Wang, Z., Southwell, R., and Whitehill, J. 2024. Speaker diarization in the classroom: How much does each student speak in group discussions? In Proceedings of the 17th International Conference on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data Mining Society, 360–367.

Whitehill, J. and LoCasale-Crouch, J. 2023. Automated evaluation of classroom instructional support with llms and bows: Connecting global predictions to specific feedback. Journal of Educational Data Mining 16, 1, 34–60.

Yang, Q., Zimmermann, K., Bartholomew, C. P., Purtell, K. M., and Ansari, A. 2023. Preschool classroom age composition and physical literacy environment: Influence on children’s emergent literacy outcomes. Early Education and Development 35, 1–18.

Zhang, C., Shi, J., Weng, C., Yu, M., and Yu, D. 2022. Towards end-to-end speaker diarization with generalized neural speaker clustering. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8372–8376.
Section
Extended Articles from the EDM 2024 Conference