Optimizing Speaker Diarization for the Classroom: Applications in Timing Student Speech and Distinguishing Teachers from Children
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
An important dimension of classroom group dynamics & collaboration is how much each person contributes to the discussion. With the goal of distinguishing teachers' speech from children's speech and measuring how much each student speaks, we have investigated how automatic speaker diarization can be built to handle real-world classroom group discussions. We examined key design considerations such as the level of granularity of speaker assignment, speech enhancement techniques, voice activity detection, and embedding assignment methods to find an effective configuration. The best speaker diarization system we found was based on the ECAPA-TDNN speaker embedding model and used Whisper automatic speech recognition to identify speech segments. The diarization error rate (DER) in challenging noisy spontaneous classroom data was around 34%, and the correlations of estimated vs. human annotations of how much each student spoke reached 0.62. The accuracy of distinguishing teachers' speech from children's speech was 69.17%. We evaluated the system for potential accuracy bias across people of different skin tones and genders and found that the accuracy did not show statistically significantly differences across either dimension. Thus, the presented diarization system has potential to benefit educational research and to provide teachers and students with useful feedback to better understand their classroom dynamics.
How to Cite
##plugins.themes.bootstrap3.article.details##
automatic speech recognition, automatic classroom analysis, group collaboration, speaker diarization
Amazon. 2021. Amazon transcribe.
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., and Vinyals, O. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20, 2, 356–370.
Araki, S., Fujimoto, M., Ishizuka, K., Sawada, H., and Makino, S. 2008. A DOA based speaker diarization system for real meetings. In 2008 Hands-Free Speech Communication and Microphone Arrays. IEEE, 29–32.
Bain, M., Huh, J., Han, T., and Zisserman, A. 2023. Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747.
Bao, W. 2020. Covid-19 and online teaching in higher education: A case study of Peking University. Human Behavior and Emerging Technologies 2, 2, 113–115.
Beccaro, W., Arjona Ramírez, M., Liaw, W., and Guimarães, H. R. 2024. Analysis of oral exams with speaker diarization and speech emotion recognition: A case study. IEEE Transactions on Education 67, 1, 74–86.
Bredin, H. 2017. pyannote. metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), F. Lacerda, Ed. International Speech Communication Association (ISCA), 3587–3591.
Cao, J., Ganesh, A., Cai, J., Southwell, R., Perkoff, E. M., Regan, M., Kann, K., Martin, J. H., Palmer, M., and D’Mello, S. 2023. A comparative analysis of automatic speech recognition errors in small group classroom discourse. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization. Association for Computing Machinery, New York, NY, USA, 250–262.
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., and Wei, F. 2022. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6, 1505–1518.
Desplanques, B., Thienpondt, J., and Demuynck, K. 2020. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.
Dutta, S., Irvin, D., Buzhardt, J., and Hansen, J. H. 2022. Activity focused speech recognition of preschool children in early childhood classrooms. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). Association for Computational Linguistics, Seattle, Washington, 92–100.
Fitzpatrick, T. B. 1988. The Validity and Practicality of Sun-Reactive Skin Types I Through VI. Archives of Dermatology 124, 6 (06), 869–871.
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., and Watanabe, S. 2019. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 296–303.
Fung, D. C.-L., To, H., and Leung, K. 2016. The influence of collaborative group work on students’ development of critical thinking: the teacher’s role in facilitating group discussions. Pedagogies: An International Journal 11, 2, 146–166.
Gazawy, Q., Buyrukoglu, S., and Akbas, A. 2023. Deep learning for enhanced education quality: Assessing student engagement and emotional states. In 2023 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 1–8.
Gomez, A., Pattichis, M. S., and Celedón-Pattichis, S. 2022. Speaker diarization and identification from single channel classroom audio recordings using virtual microphones. IEEE Access 10, 56256–56266.
GoogleCloud. 2021. Detect different speakers in an audio recording.
Hagen, A., Pellom, B., and Cole, R. 2003. Children’s speech recognition with application to interactive books and tutors. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721). IEEE, 186–191.
Harrington, J., Palethorpe, S., Watson, C. I., et al. 2007. Age-related changes in fundamental frequency and formants: a longitudinal study of four speakers. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). International Speech Communication Association (ISCA), 2753–2756.
Hazirbas, C., Bitton, J., Dolhansky, B., Pan, J., Gordo, A., and Ferrer, C. C. 2022. Towards measuring fairness in AI: The casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science 4, 3, 324–332.
He, M.-K., Du, J., and Lee, C.-H. 2022. End-to-end audio-visual neural speaker diarization. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). International Speech Communication Association, 1461–1465.
He, X., Wang, J., Trinh, V. A., McReynolds, A., and Whitehill, J. 2024. Tracking classroom movement patterns with person re-ID. In Proceedings of the 17th International Conference on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data Mining Society, 679–685.
Howard, J. R. 2015. Discussion in the college classroom: Getting your students engaged and participating in person and online. John Wiley & Sons, San Francisco.
Kang, W., Roy, B. C., and Chow, W. 2020. Multimodal speaker diarization of real-world meetings using d-vectors with spatial features. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6509–6513.
Kelly, S., Olney, A. M., Donnelly, P., Nystrand, M., and D’Mello, S. K. 2018. Automatically measuring question authenticity in real-world classrooms. Educational Researcher 47, 7, 451–464.
Landini, F., Profant, J., Diez, M., and Burget, L. 2022. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech Language 71, 101254.
Luczynski, K. C. and Hanley, G. P. 2013. Prevention of problem behavior by teaching functional communication and self-control skills to preschoolers. Journal of Applied Behavior Analysis 46, 2, 355–368.
Mahmood, S. 2021. Instructional strategies for online teaching in covid-19 pandemic. Human Behavior and Emerging Technologies 3, 1, 199–203.
Markov, K. and Nakamura, S. 2008. Improved novelty detection for online GMM based speaker diarization. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). International Speech Communication Association, 363–366.
Olney, A. M., Donnelly, P. J., Samei, B., and D’Mello, S. K. 2017. Assessing the dialogic properties of classroom discourse: Proportion models for imbalanced classes. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society, 162–167.
Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., and Narayanan, S. 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech Language 72, 101317.
Pianta, R. and Burchinal, M. 2016. National center for research on early childhood education teacher professional development study (2007-2011). https://doi.org/10.3886/ICPSR34848.v2.
Pianta, R., Hamre, B., Downer, J., Burchinal, M., Williford, A., Locasale-Crouch, J., Howes, C., La Paro, K., and Scott-Little, C. 2017. Early childhood professional development: Coaching and coursework effects on indicators of children’s school readiness. Early Education and Development 28, 8, 956–975.
Pradhan, S. S., Cole, R. A., and Ward, W. H. 2023. My science tutor (myst)–a large corpus of children’s conversational speech. In International Conference on Language Resources and Evaluation.
Quansah, F. 2018. Traditional or performance assessment: What is the right way to assessing learners. Research on Humanities and Social Sciences 8, 1, 21–24.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, JMLR.org, 28492–28518.
Ramakrishnan, A., Zylich, B., Ottmar, E., LoCasale-Crouch, J., and Whitehill, J. 2023. Toward automated classroom observation: Multimodal machine learning to estimate class positive climate and negative climate. IEEE Transactions on Affective Computing 14, 1, 664–679.
Ravanelli, M., Parcollet, T., Moumen, A., de Langen, S., Subakan, C., Plantinga, P., Wang, Y., Mousavi, P., Libera, L. D., Ploujnikov, A., Paissan, F., Borra, D., Zaiem, S., Zhao, Z., Zhang, S., Karakasidis, G., Yeh, S.-L., Champion, P., Rouhe, A., Braun, R., Mai, F., Zuluaga-Gomez, J., Mousavi, S. M., Nautsch, A., Liu, X., Sagar, S., Duret, J., Mdhaffar, S., Laperriere, G., Rouvier, M., Mori, R. D., and Esteve, Y. 2024. Open-source conversational AI with SpeechBrain 1.0.
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., Mori, R. D., and Bengio, Y. 2021. Speechbrain: A general-purpose speech toolkit.
Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. 2000. Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10, 1-3, 19–41.
Rouvier, M., Bousquet, P.-M., and Favre, B. 2015. Speaker diarization through speaker embeddings. In 2015 23rd European Signal Processing Conference (EUSIPCO). IEEE, 2082–2086.
Sedova, K., Sedlacek, M., Svaricek, R., Majcik, M., Navratilova, J., Drexlerova, A., Kychler, J., and Salamounova, Z. 2019. Do those who talk more learn more? the relationship between student classroom talk and student achievement. Learning and Instruction 63, 101217.
Shobaki, Khaldoun, Hosom, John-Paul, and Cole, Ronald Allan. 2007. CSLU: Kids‘ Speech Version 1.1.
Silero Team. 2024. Silero VAD: pre-trained enterprise-grade voice activity detector (VAD), number detector and language classifier. https://github.com/snakers4/silero-vad.
Southwell, R., Pugh, S. L., Perkoff, E. M., Clevenger, C., Bush, J. B., Lieber, R., Ward, W. H., Foltz, P. W., and D’Mello, S. K. 2022. Challenges and feasibility of automatic speech recognition for modeling student collaborative discourse in classrooms. In Educational Data Mining. International Educational Data Mining Society.
Sümer, Ö., Goldberg, P., D’Mello, S., Gerjets, P., Trautwein, U., and Kasneci, E. 2021. Multimodal engagement analysis from facial videos in the classroom. IEEE Transactions on Affective Computing 14, 2, 1012–1027.
Thomas, D. R., Lin, J., Bhushan, S., Abboud, R., Gatz, E., Gupta, S., and Koedinger, K. R. 2024. Learning and ai evaluation of tutors responding to students engaging in negative self-talk. In Proceedings of the Eleventh ACM Conference on Learning @ Scale. L@S ’24. Association for Computing Machinery, New York, NY, USA, 481–485.
Wang, J., Dudy, S., He, X., Wang, Z., Southwell, R., and Whitehill, J. 2024. Speaker diarization in the classroom: How much does each student speak in group discussions? In Proceedings of the 17th International Conference on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data Mining Society, 360–367.
Whitehill, J. and LoCasale-Crouch, J. 2023. Automated evaluation of classroom instructional support with llms and bows: Connecting global predictions to specific feedback. Journal of Educational Data Mining 16, 1, 34–60.
Yang, Q., Zimmermann, K., Bartholomew, C. P., Purtell, K. M., and Ansari, A. 2023. Preschool classroom age composition and physical literacy environment: Influence on children’s emergent literacy outcomes. Early Education and Development 35, 1–18.
Zhang, C., Shi, J., Weng, C., Yu, M., and Yu, D. 2022. Towards end-to-end speaker diarization with generalized neural speaker clustering. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8372–8376.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.