Designing Safe and Relevant Generative Chats for Math Learning in Intelligent Tutoring Systems
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
Large language models (LLMs) are flexible, personalizable, and available, which makes their use within Intelligent Tutoring Systems (ITSs) appealing. However, their flexibility creates risks: inaccuracies, harmful content, and non-curricular material. Ethically deploying LLM-backed ITSs requires designing safeguards that ensure positive experiences for students. We describe the design of a conversational system integrated into an ITS that uses safety guardrails and retrieval-augmented generation to support middle-grade math learning. We evaluated this system using red-teaming, offline analyses, an in-classroom usability test, and a field deployment. We present empirical data from more than 8,000 student conversations designed to encourage a growth mindset, finding that the GPT-3.5 LLM rarely generates inappropriate messages and that retrieval-augmented generation improves response quality. The student interaction behaviors we observe provide implications for designers---to focus on student inputs as a content moderation problem---and implications for researchers---to focus on subtle forms of bad content and creating metrics and evaluation processes.Code and data are available at https://www.github.com/DigitalHarborFoundation/chatbot-safety and https://www.github.com/DigitalHarborFoundation/rag-for-math-qa.
How to Cite
##plugins.themes.bootstrap3.article.details##
large language models, intelligent tutoring systems, safety, system design
Agiza, A., Mostagir, M., and Reda, S. 2024. PoliTune: Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in Large Language Models. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7, 2–12.
Aleven, V., Baraniuk, R., Brunskill, E., Crossley, S., Demszky, D., Fancsali, S., Gupta, S., Koedinger, K., Piech, C., Ritter, S., Thomas, D. R., Woodhead, S., and Xing, W. 2023. Towards the Future of AI-Augmented Human Tutoring in Math Learning. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, and O. C. Santos, Eds. Communications in Computer and Information Science. Springer Nature Switzerland, Cham, 26–31.
Aleven, V. and Koedinger, K. R. 2001. Investigations into Help Seeking and Learning with a Cognitive Tutor. Working Notes of the AIED 2001 Workhop “Help Provision And Help Seeking In Interactive Learning Environments”.
Althammer, S., Hofstätter, S., Verberne, S., and Hanbury, A. 2022. TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, M. A. Hasan and L. Xiong, Eds. ACM, Atlanta GA USA, 3801–3805.
Arroyo, I., Royer, J. M., and Woolf, B. P. 2011. Using an intelligent tutor and math fluency training to improve math performance. International Journal of Artificial Intelligence in Education 21, 1-2, 135–152. IOS Press.
Baker, R. S. 2007. Modeling and understanding students’ off-task behavior in intelligent tutoring systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, M. B. Rosson and D. Gilmore, Eds. CHI ’07. Association for Computing Machinery, New York, NY, USA, 1059–1068.
Baker, R. S. 2016. Stupid Tutoring Systems, Intelligent Humans. International Journal of Artificial Intelligence in Education 26, 2 (June), 600–614.
Baker, R. S., Corbett, A. T., and Koedinger, K. R. 2004. Detecting Student Misuse of Intelligent Tutoring Systems. In Intelligent Tutoring Systems, J. C. Lester, R. M. Vicari, and F. Paraguaçu, Eds. Springer, Berlin, Heidelberg, 531–540.
Baker, R. S. and Hawn, A. 2022. Algorithmic Bias in Education. International Journal of Artificial Intelligence in Education 32, 4 (Dec.), 1052–1092.
Banawan, M., Shin, J., Balyan, R., Leite, W. L., and McNamara, D. S. 2022. Math Discourse Linguistic Components (Cohesive Cues within a Math Discussion Board Discourse). In Proceedings of the Ninth ACM Conference on Learning @ Scale, R. F. Kizilcec, K. Davis, and X. Ochoa, Eds. L@S ’22. Association for Computing Machinery, New York, NY, USA, 389–394.
Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakci, O., and Mariman, R. 2024. Generative AI Can Harm Learning. Available at SSRN.
Beck, K. A., Ogloff, J. R. P., and Corbishley, A. 1994. Knowledge, Compliance, and Attitudes of Teachers toward Mandatory Child Abuse Reporting in British Columbia. Canadian Journal of Education / Revue canadienne de l’éducation 19, 1, 15–29. Canadian Society for the Study of Education.
Berger, E., Chionh, N., and Miko, A. 2022. School Leaders’ Experiences on Dealing with Students Exposed to Domestic Violence. Journal of Family Violence 37, 7 (Oct.), 1089–1100.
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., De Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J., Elsen, E., and Sifre, L. 2022. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds. Proceedings of Machine Learning Research, vol. 162. PMLR, 2206–2240.
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, O. E., Yuan, Z., Elliott, M., Moore, R., Bryant, C., Rei, M., Yannakoudakis, H., Mullooly, A., Nicholls, D., and Buttery, P. 2023. On the Application of Large Language Models for Language Teaching and Assessment Technology. In LLM@AIED, S. Moore, J. Stamper, R. Tong, C. Cao, Z. Liu, X. Hu, Y. Lu, J. Liang, H. Khosravi, P. Denny, A. Singh, and C. Brooks, Eds. 173–197.
Chase, H. 2023. How to use the Parent Document Retriever. https://python.langchain.com/docs/how_to/parent_document_retriever/.
Chen, D., Fisch, A., Weston, J., and Bordes, A. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y. Kan, Eds. Association for Computational Linguistics, Vancouver, Canada, 1870–1879.
Chiesurin, S., Dimakopoulos, D., Sobrevilla Cabezudo, M. A., Eshghi, A., Papaioannou, I., Rieser, V., and Konstas, I. 2023. The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering. In Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, Toronto, Canada, 947–959.
Cukurova, M., Khan-Galaria, M., Millán, E., and Luckin, R. 2022. A learning analytics approach to monitoring the quality of online one-to-one tutoring. Journal of Learning Analytics 9, 2, 105–120.
Demszky, D., Liu, J., Hill, H. C., Jurafsky, D., and Piech, C. 2024. Can Automated Feedback Improve Teachers’ Uptake of Student Ideas? Evidence From a Randomized Controlled Trial in a Large-Scale Online Course. Educational Evaluation and Policy Analysis 46, 3 (Sept.), 483–505. American Educational Research Association.
Dinan, E., Abercrombie, G., Bergman, A. S., Spruit, S., Hovy, D., Boureau, Y.-L., and Rieser, V. 2021. Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling. arXiv:2107.03451 [cs].
Dziri, N., Kamalloo, E., Milton, S., Zaiane, O., Yu, M., Ponti, E. M., and Reddy, S. 2022. FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Transactions of the Association for Computational Linguistics 10, 1473–1490. MIT Press.
Emmer, E. T. and Stough, L. M. 2001. Classroom Management: A Critical Part of Educational Psychology, With Implications for Teacher Education. Educational Psychologist 36, 2 (June), 103–112. Routledge.
Falkiner, M., Thomson, D., and Day, A. 2017. Teachers’ Understanding and Practice of Mandatory Reporting of Child Maltreatment. Children Australia 42, 1 (Mar.), 38–48.
Falkiner, M., Thomson, D., Guadagno, B., and Day, A. 2020. Heads you win, tails I lose: The dilemma mandatory reporting poses for teachers. Australian Journal of Teacher Education (Online) 42, 9 (Aug.), 93–110. Edith Cowan University.
Feffer, M., Sinha, A., Deng, W. H., Lipton, Z. C., and Heidari, H. 2024. Red-Teaming for Generative AI: Silver Bullet or Security Theater? Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7, 421–437.
Fischer, J. E. 2023. Generative AI Considered Harmful. In Proceedings of the 5th International Conference on Conversational User Interfaces, M. Lee, C. Munteanu, M. Porcheron, J. Trippas, and S. T. Völkel, Eds. CUI ’23. Association for Computing Machinery, New York, NY, USA, 1–5.
Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., Rieser, V., Iqbal, H., Tomašev, N., Ktena, I., Kenton, Z., Rodriguez, M., El-Sayed, S., Brown, S., Akbulut, C., Trask, A., Hughes, E., Bergman, A. S., Shelby, R., Marchal, N., Griffin, C., Mateos-Garcia, J., Weidinger, L., Street, W., Lange, B., Ingerman, A., Lentz, A., Enger, R., Barakat, A., Krakovna, V., Siy, J. O., Kurth-Nelson, Z., McCroskery, A., Bolina, V., Law, H., Shanahan, M., Alberts, L., Balle, B., Haas, S. d., Ibitoye, Y., Dafoe, A., Goldberg, B., Krier, S., Reese, A., Witherspoon, S., Hawkins, W., Rauh, M., Wallace, D., Franklin, M., Goldstein, J. A., Lehman, J., Klenk, M., Vallor, S., Biles, C., Morris, M. R., King, H., Arcas, B. A. y., Isaac, W., and Manyika, J. 2024. The Ethics of Advanced AI Assistants. arXiv:2404.16244.
Girard, S. and Johnson, H. 2010. What Do Children Favor as Embodied Pedagogical Agents? In Intelligent Tutoring Systems, V. Aleven, J. Kay, and J. Mostow, Eds. Springer, Berlin, Heidelberg, 307–316.
Goldman, E. J., Baumann, A.-E., and Poulin-Dubois, D. 2023. Preschoolers’ anthropomorphizing of robots: Do human-like properties matter? Frontiers in Psychology 13.
Goldman, J. D. G. 2007. Primary school student-teachers’ knowledge and understandings of child sexual abuse and its mandatory reporting. International Journal of Educational Research 46, 6 (Jan.), 368–381.
Good, T. L. and Brophy, J. E. 2008. Looking in Classrooms. Pearson/Allyn and Bacon.
Graesser, A. C., Person, N. K., and Magliano, J. P. 1995. Collaborative dialogue patterns in naturalistic one-to-one tutoring. Applied Cognitive Psychology 9, 6, 495–522.
Greene, R., Sanders, T., Weng, L., and Neelakantan, A. 2022. New and improved embedding model. https://openai.com/blog/new-and-improved-embedding-model.
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. 2020. REALM: retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, H. Daumé and A. Singh, Eds. ICML’20, vol. 119. JMLR.org, 3929–3938.
Haney, M. R. 2004. Ethical Dilemmas Associated With Self-Disclosure in Student Writing. Teaching of Psychology. 31(3), 167–171.
Hendrycks, D., Carlini, N., Schulman, J., and Steinhardt, J. 2022. Unsolved Problems in ML Safety. arXiv:2109.13916.
Henkel, O., Hills, L., Boxer, A., Roberts, B., and Levonian, Z. 2024. Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, D. Joyner, M. K. Kim, X. Wang, and M. Xia, Eds. L@S ’24. Association for Computing Machinery, New York, NY, USA, 300–304.
Henkel, O., Horne-Robinson, H., Kozhakhmetova, N., and Lee, A. 2024. Effective and Scalable Math Support: Evidence on the Impact of an AI- Tutor on Math Achievement in Ghana. arXiv:2402.09809 [cs].
Henkel, O., Levonian, Z., Li, C., and Postle, M. 2024. Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference. In Proceedings of the 17th International Conference on Educational Data Mining, C. D. Epp, B. Paassen, and D. Joyner, Eds. International Educational Data Mining Society, Atlanta, GA, USA, 315–320.
Hobert, S. and Wolff, R. M. v. 2019. Say Hello to Your New Automated Tutor – A Structured Literature Review on Pedagogical Conversational Agents. Wirtschaftsinformatik 2019 Proceedings.
Hofstätter, S., Zlabinger, M., Sertkan, M., Schröder, M., and Hanbury, A. 2020. Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and P. C. Mauroux, Eds. CIKM ’20. Association for Computing Machinery, New York, NY, USA, 3031–3038.
Holstein, K., McLaren, B. M., and Aleven, V. 2017. Intelligent tutors as teachers’ aides: exploring teacher needs for real-time analytics in blended classrooms. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, A. Wise, P. H. Winne, G. Lynch, X. Ochoa, I. Molenaar, S. Dawson, and M. Hatala, Eds. LAK ’17. Association for Computing Machinery, New York, NY, USA, 257–266.
Hurrell, D. 2021. Conceptual knowledge or procedural knowledge or conceptual knowledge and procedural knowledge: Why the conjunction is important to teachers. Australian Journal of Teacher Education (Online) 46, 2, 57–71.
Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., and Grave, E. 2024. Atlas: few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 24, 1 (Mar.), 251:11912–251:11954.
Jhaver, S., Zhang, A. Q., Chen, Q. Z., Natarajan, N., Wang, R., and Zhang, A. X. 2023. Personalizing Content Moderation on Social Media: User Perspectives on Moderation Choices, Interface Design, and Labor. Proceedings of the ACM on Human-Computer Interaction 7, CSCW2 (Oct.), 289:1–289:33.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys 55, 12, 1–38.
Karumbaiah, S., Lizarralde, R., Allessio, D., Woolf, B., Arroyo, I., and Wixon, N. 2017. Addressing Student Behavior and Affect with Empathy and Growth Mindset. In Proceedings of the International Conference on Educational Data Mining, Hu, Xiangen, Barnes, Tiffany, Hershkovitz, Arnon, and Paquette, Luc, Eds. International Educational Data Mining Society, Wuhan, China, 96–103.
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., and Kasneci, G. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103, 102274.
Khosrawi-Rad, B., Rinn, H., Schlimbach, R., Gebbing, P., Yang, X., Lattemann, C., Markgraf, D., and Robra-Bissantz, S. 2022. Conversational Agents in Education – A Systematic Literature Review. ECIS 2022 Research Papers. Article 18.
Kizilcec, R. F. and Goldfarb, D. 2019. Growth Mindset Predicts Student Achievement and Behavior in Mobile Learning. In Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale. L@S ’19. Association for Computing Machinery, New York, NY, USA, 1–10.
Kraft, M. A. and Falken, G. T. 2021. A blueprint for scaling tutoring and mentoring across public schools. AERA Open 7, 1, 1–21. Los Angeles, CA. SAGE Publications.
Lazaridou, A., Gribovskaya, E., Stokowiec, W. J., and Grigorev, N. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv:2203.05115 [cs.CL].
Leach, D. and Helf, S. 2016. Using a Hierarchy of Supportive Consequences to Address Problem Behaviors in the Classroom. Intervention in School and Clinic 52, 1 (Sept.), 29–33. SAGE Publications.
Lehmann, M., Cornelius, P. B., and Sting, F. J. 2024. AI Meets the Classroom: When Does ChatGPT Harm Learning? arXiv:2409.09047.
Levin, J. and Nolan, J. F. 2002. What Every Teacher Should Know About Classroom Management, 1st ed. Pearson.
Levonian, Z. and Henkel, O. 2024. Safe Generative Chats in a WhatsApp Intelligent Tutoring System. In Joint Proceedings of the Human-Centric eXplainable AI in Education and the Leveraging Large Language Models for Next Generation Educational Technologies Workshops (HEXED-L3MNGET 2024) co-located with 17th International Conference on Educational Data Mining (EDM 2024), J. D. Pinto, E. Worden, A. Botelho, L. Cohausz, C. Cohn, M. Feng, N. Heffernan, A. Hellas, L. Jiang, D. Joyner, T. Käser, J. Kim, A. Lan, C. Li, J. Littenberg-Tobias, Q. Liu, C. MacLellan, S. Moore, M. Pankiewicz, L. Paquette, Z. A. Pardos, A. Rafferty, A. Singla, S. Sonkar, V. Swamy, R. E. Wang, and C. Walkington, Eds. Atlanta, GA, USA. arXiv:2407.04915 [cs].
Levonian, Z., Henkel, O., and Roberts, B. 2023. llm-math-education: Retrieval augmented generation for middle-school math question answering and hint generation. https://zenodo.org/record/8284412.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds. NeurIPS’20. Curran Associates Inc., Red Hook, NY, USA, 9459–9474.
Li, C. and Xing, W. 2021. Natural language generation using deep learning to support MOOC learners. International Journal of Artificial Intelligence in Education 31, 186–214. Springer.
Liang, P. P., Wu, C., Morency, L.-P., and Salakhutdinov, R. 2021. Towards Understanding and Mitigating Social Biases in Language Models. In Proceedings of the 38th International Conference on Machine Learning, Marina Meila and Tong Zhang, Eds. PMLR, 6565–6576. ISSN: 2640-3498.
Lieb, A. and Goel, T. 2024. Student Interaction with NewtBot: An LLM-as-tutor Chatbot for Secondary Physics Education. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, F. F. Mueller, P. Kyburz, J. R. Williamson, and C. Sas, Eds. CHI EA ’24. Association for Computing Machinery, New York, NY, USA, 1–8.
Lin, J., Ma, X., Lin, S.-C., Yang, J.-H., Pradeep, R., and Nogueira, R. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai, Eds. SIGIR ’21. Association for Computing Machinery, New York, NY, USA, 2356–2362.
Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H. T., and Gurevych, I. 2023. Are Emergent Abilities in Large Language Models just In-Context Learning? arXiv:2309.01809 [cs].
Maiti, P. and Goel, A. K. 2024. How Do Students Interact with an LLM-powered Virtual Teaching Assistant in Different Educational Settings? In Proceedings of the Seventeenth International Conference on Educational Data Mining (EDM) Workshop: Leveraging LLMs for Next Generation Educational Technologies, J. D. Pinto, E. Worden, A. Botelho, L. Cohausz, C. Cohn, M. Feng, N. Heffernan, A. Hellas, L. Jiang, D. Joyner, T. Käser, J. Kim, A. Lan, C. Li, J. Littenberg-Tobias, Q. Liu, C. MacLellan, S. Moore, M. Pankiewicz, L. Paquette, Z. A. Pardos, A. Rafferty, A. Singla, S. Sonkar, V. Swamy, R. E. Wang, and C. Walkington, Eds. educationaldatamining.org.
Marecek, L., Anthony-Smith, M., and Honeycutt Mathis, A. 2020. Prealgebra, 2 ed. OpenStax.
Marzano, R. J. 2005. A Handbook for Classroom Management that Works. ASCD. Google-Books-ID: BMOQFLa0fcEC.
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., and Griffiths, T. L. 2024. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences 121, 41 (Oct.), e2322420121.
Mialon, G., Dessi, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Roziere, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., and Scialom, T. 2023. Augmented Language Models: a Survey. Transactions on Machine Learning Research.
Mishra, S., Khashabi, D., Baral, C., Choi, Y., and Hajishirzi, H. 2022. Reframing Instructional Prompts to GPTk’s Language. In Findings of the Association for Computational Linguistics: ACL 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Association for Computational Linguistics, Dublin, Ireland, 589–612.
Moschkovich, J. N. 2015. Scaffolding student participation in mathematical practices. Zdm 47, 1067–1078. Springer.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, A. Vlachos and I. Augenstein, Eds. Association for Computational Linguistics, Dubrovnik, Croatia, 2014–2037.
Murphy, P. K. and Alexander, P. A. 2013. Situating Text, Talk, and Transfer in Conceptual Change: Concluding Thoughts. In International Handbook of Research on Conceptual Change, 2 ed. Routledge, 603–621.
NAEP. 2022. NAEP Mathematics: National Average Scores.
Narayanan, A., Kapoor, S., and Lazar, S. 2023. Model alignment protects against accidental harms, not intentional ones. https://www.aisnakeoil.com/p/model-alignment-protects-against.
Navigli, R., Conia, S., and Ross, B. 2023. Biases in Large Language Models: Origins, Inventory, and Discussion. Journal of Data and Information Quality 15, 2 (June), 10:1–10:21.
Nickow, A., Oreopoulos, P., and Quan, V. 2020. The Impressive Effects of Tutoring on PreK-12 Learning: A Systematic Review and Meta-Analysis of the Experimental Evidence. https://www.nber.org/papers/w27476.
Nie, A., Chandak, Y., Suzara, M., Ali, M., Woodrow, J., Peng, M., Sahami, M., Brunskill, E., and Piech, C. 2024. The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances. arXiv:2407.09975.
Nye, B. D., Mee, D., and Core, M. G. 2023. Generative Large Language Models for Dialog-Based Tutoring: An Early Consideration of Opportunities and Concerns. In Proceedings of the Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation 2023 co-located with 24th International Conference on Artificial Intelligence in Education (AIED 2023), S. Moore, J. Stamper, R. Tong, C. Cao, Z. Liu, X. Hu, Y. Lu, J. Liang, H. Khosravi, P. Denny, A. Singh, and C. Brooks, Eds. Tokyo, Japan.
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs].
OpenAI. 2024. Moderation: OpenAI API. https://platform.openai.com/docs/guides/moderation/overview.
Pardos, Z. A. and Bhandari, S. 2023. Learning gain differences between chatgpt and human tutor generated algebra hints. CoRR abs/2302.06871.
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Huang, Q., Liden, L., Yu, Z., Chen, W., and Gao, J. 2023. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. arXiv:2302.12813 [cs].
Psotka, J., Massey, L. D., and Mutter, S. A., Eds. 1988. Intelligent tutoring systems: Lessons learned. Lawrence Erlbaum Associates, Hillsdale, NJ, US.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, and X. Carreras, Eds. Association for Computational Linguistics, Austin, Texas, 2383–2392.
Ritter, S., Anderson, J. R., Koedinger, K. R., and Corbett, A. 2007. Cognitive Tutor: Applied research in mathematics education. Psychonomic bulletin & review 14, 249–255. Springer.
Rittle-Johnson, B., Schneider, M., and Star, J. R. 2015. Not a one-way street: Bidirectional relations between procedural and conceptual knowledge of mathematics. Educational Psychology Review 27, 587–597. Springer.
Rodrigo, M. M. T., Baker, R. S. J. D., and Rossi, L. 2013. Student Off-Task Behavior in Computer-Based Learning in the Philippines: Comparison to Prior Research in the USA. Teachers College Record 115, 10 (Oct.), 1–27. SAGE Publications.
Sabornie, E. J. and Espelage, D. L. 2022. Handbook of Classroom Management, 3rd ed. Routledge.
Sedlmeier, P. 2001. Intelligent Tutoring Systems. In International Encyclopedia of the Social & Behavioral Sciences, N. J. Smelser and P. B. Baltes, Eds. Pergamon, Oxford, 7674–7678.
Seering, J. 2020. Reconsidering Self-Moderation: the Role of Research in Supporting Community-Based Models for Online Content Moderation. Proc. ACM Hum.-Comput. Interact. 4, CSCW2 (Oct.), 107:1–107:28.
Sellam, T., Das, D., and Parikh, A. 2020. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Association for Computational Linguistics, Online, 7881–7892.
Shen, J. T., Yamashita, M., Prihar, E., Heffernan, N., Wu, X., Graff, B., and Lee, D. 2021. MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education. In Proceedings of the NeurIPS 2021 Math AI for Education Workshop, Pan Lu, Yuhuai Wu, Sean Welleck, Xiaodan Liang, Eric Xing, and James McClelland, Eds. neurips.cc, Virtual, 1–11.
Skiba, R., Ormiston, H., Martinez, S., and Cummings, J. 2016. Teaching the Social Curriculum: Classroom Management as Behavioral Instruction. Theory Into Practice 55, 2 (Apr.), 120–128.
Smith, R. 2022. How “both-sideism” harms health. BMJ 378, o2136. British Medical Journal Publishing Group, Opinion.
Sonkar, S., Liu, N., Mallick, D., and Baraniuk, R. 2023. CLASS: A Design Framework for Building Intelligent Tutoring Systems Based on Learning Science principles. In Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, Singapore, 1941–1961.
Sottilare, R. A., Graesser, A., Hu, X., and Goldberg, B. S. 2014. Design Recommendations for Intelligent Tutoring Systems. Volume 2: Instructional Management. Tech. rep., University of Southern California Los Angeles. Jan. Technical Reports.
Stapleton, L., Liu, S., Liu, C., Hong, I., Chancellor, S., Kraut, R. E., and Zhu, H. 2024. ”If This Person is Suicidal, What Do I Do?”: Designing Computational Approaches to Help Online Volunteers Respond to Suicidality. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, F. F. Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. T. Dugas, and I. Shklovski, Eds. CHI ’24. Association for Computing Machinery, New York, NY, USA. Honolulu, HI, USA.
Sung, S. H., Li, C., Chen, G., Huang, X., Xie, C., Massicotte, J., and Shen, J. 2021. How does augmented observation facilitate multimodal representational thinking? Applying deep learning to decode complex student construct. Journal of Science Education and Technology 30, 210–226. Springer.
Tao, Y., Viberg, O., Baker, R. S., and Kizilcec, R. F. 2024. Cultural bias and cultural alignment of large language models. PNAS Nexus 3, 9 (Sept.), pgae346.
Upadhyay, S., Ginsberg, E., and Callison-Burch, C. 2023. Improving Mathematics Tutoring With A Code Scratchpad. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, and T. Zesch, Eds. Association for Computational Linguistics, Toronto, Canada, 20–28.
VanLehn, K. 2006. The Behavior of Tutoring Systems. International Journal of Artificial Intelligence in Education 16, 3 (Aug.), 227–265.
VanLehn, K. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist 46, 4, 197–221. Taylor & Francis.
Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., and Demszky, D. 2024. Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise. arXiv:2410.03017.
Wang, X., Wang, Z., Liu, J., Chen, Y., Yuan, L., Peng, H., and Ji, H. 2024. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. In The Twelfth International Conference on Learning Representations, B. Kim, S. Chaudhuri, K. Fragkiadaki, M. E. Khan, and Y. Sun, Eds.
Weinstein, C. S., Tomlinson-Clarke, S., and Curran, M. 2004. Toward a Conception of Culturally Responsive Classroom Management. Journal of Teacher Education 55, 1 (Jan.), 25–38.
Xu, Y., Hu, L., Zhao, J., Qiu, Z., Ye, Y., and Gu, H. 2024. A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias. arXiv:2404.00929 [cs].
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., and Gašević, D. 2024. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology 55, 1, 90–112.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., and Chen, X. 2024. Large Language Models as Optimizers. In The Twelfth International Conference on Learning Representations, B. Kim, S. Chaudhuri, K. Fragkiadaki, M. E. Khan, and Y. Sun, Eds.
Yang, K., Swope, A. M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, E. Denton, J.-W. Ha, and J. Vanschoren, Eds.
Yang, K. B., Nagashima, T., Yao, J., Williams, J. J., Holstein, K., and Aleven, V. 2021. Can Crowds Customize Instructional Materials with Minimal Expert Guidance? Exploring Teacher-guided Crowdsourcing for Improving Hints in an AI-based Tutor. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (Apr.), 119:1–119:24.
Yang, W., Lu, K., Yang, P., and Lin, J. 2019. Critically Examining the ”Neural Hype”: Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, B. Piwowarski, M. Chevalier, E. Gaussier, Y. Maarek, J.-Y. Nie, and F. Scholer, Eds. SIGIR’19. Association for Computing Machinery, New York, NY, USA, 1129–1132.
Yeager, D. S., Romero, C., Paunesku, D., Hulleman, C. S., Schneider, B., Hinojosa, C., Lee, H. Y., O’Brien, J., Flint, K., Roberts, A., Trott, J., Greene, D., Walton, G. M., and Dweck, C. S. 2016. Using Design Thinking to Improve Psychological Interventions: The Case of the Growth Mindset During the Transition to High School. Journal of Educational Psychology 108, 3 (Apr.), 374–391.
Zamani, H., Diaz, F., Dehghani, M., Metzler, D., and Bendersky, M. 2022. Retrieval-Enhanced Machine Learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, E. Amigo, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, and G. Kazai, Eds. SIGIR ’22. Association for Computing Machinery, New York, NY, USA, 2875–2886.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, A. Rush, S. Mohamed, D. Song, K. Cho, and M. White, Eds. Virtual.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.