Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI‑Generated and Human Data

Main

Sidebar

Published January 14, 2026
Jihong Zhang Xinya Liang Deng Anqi Nicole Bonge Lin Tan Ling Zhang Nicole Zarrett

Abstract

Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. Item-level analysis revealed higher discrepancies for negatively worded questions, suggesting LLMs struggle with emotional nuance. Person-level differences indicated varying model performance across respondents, highlighting the role of interview relevance over length. Despite replicating individual item trends, LLMs faltered in reconstructing the test’s psychometric structure. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research. The R code and the supplementary materials are available on the OSF platform (DOI:10.17605/OSF.IO/AFQG3).

How to Cite

Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI‑Generated and Human Data. (2026). Journal of Educational Data Mining, 18(1), 1-24. https://doi.org/10.5281/zenodo.18733538
Abstract 135 | PDF Downloads 246 HTML Downloads 357

Details

Keywords

quantitative data, qualitative data, LLM-driven interview, survey, behavioral regulations in exercise

References
Agarwal, M., Goswami, A., Sharma, P., Agarwal, M., Goswami, A., & Sharma, P. (2023). Evaluating ChatGPT-3.5 and claude-2 in answering and explaining conceptual medical physiology multiple-choice questions. Cureus, 15(9), e46222. https://doi.org/10.7759/cureus.46222

Anthropic (2025). Claude 2 [Large language model]. https://www.anthropic.com/news/claude-2

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337–351. https://doi.org/10.1017/pan.2023.2

Binz, M. & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences, 120(6), e2218523120. https://doi.org/10.1073/pnas.2218523120

Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(4), 401-416. https://doi.org/10.1017/pan.2024.5

Bishop, F. L. (2015). Using mixed methods research designs in health psychology: An illustrated discussion from a pragmatist perspective. British Journal of Health Psychology, 20(1), 5–20. https://doi.org/10.1111/bjhp.12122

Chang, S., Chaszczewicz, A., Wang, E., Josifovska, M., Pierson, E., & Leskovec, J. (2024). LLMs generate structurally realistic social networks but overestimate political homophily. arXiv. https://doi.org/10.48550/arXiv.2408.16629

Chen, J., Wang, X., Xu, R., Yuan, S., Zhang, Y., Shi, W., Xie, J., Li, S., Yang, R., Zhu, T., Chen, A., Li, N., Chen, L., Hu, C., Wu, S., Ren, S., Fu, Z., & Xiao, Y. (2024). From persona to personalization: A survey on role-playing language agents. arXiv. https://doi.org/10.48550/arXiv.2404.18231

Cid, L., Moutão, J., Leitão, J., & Alves, J. (2012). Behavioral Regulation Assessment in Exercise: Exploring an Autonomous and Controlled Motivation Index. The Spanish Journal of Psychology, 15(3), 1520–1528. https://doi.org/10.5209/rev_SJOP.2012.v15.n3.39436

Creswell, J. W. & Clark, V. L. P. (2017). Designing and conducting mixed methods research. SAGE Publications.

Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008

Ding, M., Deng, C., Choo, J., Wu, Z., Agrawal, A., Schwarzschild, A., Zhou, T., Goldstein, T., Langford, J., Anandkumar, A., & Huang, F. (2024). Easy2Hard-bench: Standardized difficulty labels for profiling LLM performance and generalization. Advances in Neural Information Processing Systems, 37, 44323–44365. https://proceedings.neurips.cc/paper_files/paper/2024/hash/4e6f22305275966513990f53cec908e0-Abstract-Datasets_and_Benchmarks_Track.html

Ekin, S. (2023). Prompt engineering for ChatGPT: A quick guide to techniques, tips, and best practices. arXiv. https://www.authorea.com/doi/full/10.36227/techrxiv.22683919?commit=95e67146c79e1ed93e4caa2eb930eb0984abec35

Fateen, M. & Mine, T. (2025). Developing a tutoring dialog dataset to optimize LLMs for educational use. arXiv. https://doi.org/10.48550/arXiv.2410.19231

Federiakin, D., Molerov, D., Zlatkin-Troitschanskaia, O., & Maur, A. (2024). Prompt engineering as a new 21st century skill. Frontiers in Education, 9. https://doi.org/10.3389/feduc.2024.1366434

Ge, T., Chan, X., Wang, X., Yu, D., Mi, H., & Yu, D. (2024). Scaling synthetic data creation with 1,000,000,000 personas. arXiv. https://doi.org/10.48550/arXiv.2406.20094

Huang, J., Jiao, W., Lam, M. H., Li, E. J., Wang, W., & Lyu, M. R. (2024). Revisiting the reliability of psychological scales on large language models. arXiv. https://doi.org/10.48550/arXiv.2305.19926

Huang, J., Wang, W., Li, E. J., Lam, M. H., Ren, S., Yuan, Y., Jiao, W., Tu, Z., & Lyu, M. R. (2024). Who is ChatGPT? Benchmarking LLMs’ psychological portrayal using PsychoBench. arXiv. https://doi.org/10.48550/arXiv.2310.01386

Jansen, B. J., Salminen, J., Jung, S., & Guan, K. (2022). Data-driven personas. Springer Nature.

Jiang, H., Zhang, X., Cao, X., Breazeal, C., Roy, D., & Kabbara, J. (2023). PersonaLLM: Investigating the ability of large language models to express personality traits. arXiv. https://doi.org/10.48550/arXiv.2305.02547

Johnson, R. B., Onwuegbuzie, A. J., & Turner, L. A. (2007). Toward a definition of mixed methods research. Journal of Mixed Methods Research, 1(2), 112–133. https://doi.org/10.1177/1558689806298224

Laverghetta Jr., A. & Licato, J. (2023). Generating better items for cognitive assessments using large language models. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) (pp. 414–428). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.34

Li, A., Chen, H., Namkoong, H., & Peng, T. (2025). LLM generated persona is a promise with a catch. arXiv. https://doi.org/10.48550/arXiv.2503.16527

Li, Y., Huang, Y., Wang, H., Zhang, X., Zou, J., & Sun, L. (2024). Quantifying AI psychology: A psychometrics benchmark for large language models. arXiv. https://doi.org/10.48550/arXiv.2406.17675

Liu, Y., Bhandari, S., & Pardos, Z. A. (2025). Leveraging LLM respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, 56(3), 1028–1052. https://doi.org/10.1111/bjet.13570

Liu, Y., Sharma, P., Oswal, M. J., Xia, H., & Huang, Y. (2024). PersonaFlow: Boosting research ideation with LLM-simulated expert personas. arXiv. https://doi.org/10.48550/arXiv.2409.12538

Lozić, E. & Štular, B. (2023). Fluent but not factual: A comparative analysis of ChatGPT and other AI chatbots’ proficiency and originality in scientific writing for humanities. Future Internet, 15(1010), 336. https://doi.org/10.3390/fi15100336

Mancoridis, M., Weeks, B., Vafa, K., & Mullainathan, S. (2025, June 18). Potemkin Understanding in Large Language Models. Forty-second International Conference on Machine Learning. https://openreview.net/forum?id=oetxkccLoq

May, T. A., Stone, G. E., Fan, Y., Sondergeld, C. J., LaPlante, J. N., Provinzano, K., Koskey, K. L. K., & Johnson, C. C. (2025). Using generative artificial intelligence tools to develop multiple-choice assessment items: An effectiveness study. American Educational Research Association.

Mendonça, P. C., Quintal, F., & Mendonça, F. (2025). Evaluating LLMs for automated scoring in formative assessments. Applied Sciences, 15(55), 2787. https://doi.org/10.3390/app15052787

Mullan, E., Markland, D., & Ingledew, D. (1997). A graded conceptualisation of self-determination in the regulation of exercise behaviour: development of a measure using confirmatory factor analytic procedures. Pers Individ Differ, 23, 745–752. https://doi.org/10.1016/S0191-8869(97)00107-4

Mullan, E. & Markland, D. (1997), Variations in self-determination across the stages of change for exercise in adult. Motivation and Emotion, 21, 349–362. https://doi.org/10.1023/A:1024436423492

Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on medical challenge problems. arXiv. https://doi.org/10.48550/arXiv.2303.13375

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., …& Zoph, B. (2023). GPT-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774

Parker, M. J., Anderson, C., Stone, C., & Oh, Y. (2024). A Large Language Model Approach to Educational Survey Feedback Analysis. International Journal of Artificial Intelligence in Education. https://doi.org/10.1007/s40593-024-00414-0

Peng, Q., Liu, H., Xu, H., Yang, Q., Shao, M., & Wang, W. (2024). Review-LLM: Harnessing large language models for personalized review generation. arXiv. https://doi.org/10.48550/arXiv.2407.07487

Ponce, O. A. & Pagán-Maldonado, N. (2015). Mixed methods research in education: Capturing the complexity of the profession. International Journal of Educational Excellence, 1(1), 111–135. https://doi.org/10.18562/ijee.2015.0005

Powell, H., Mihalas, S., Onwuegbuzie, A. J., Suldo, S., & Daley, C. E. (2008). Mixed methods research in school psychology: A mixed methods investigation of trends in the literature. Psychology in the Schools, 45(4), 291–309. https://doi.org/10.1002/pits.20296

Rasheed, Z., Waseem, M., Ahmad, A., Kemell, K.-K., Xiaofeng, W., Duc, A. N., & Abrahamsson, P. (2024). Can large language models serve as data analysts? A multi-agent assisted approach for qualitative data analysis. arXiv. https://doi.org/10.48550/arXiv.2402.01386

Saab, K., Tu, T., Weng, W.-H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., Chaves, J. Z., Hu, S.-Y., Schaekermann, M., Kamath, A., Cheng, Y., Barrett, D. G. T., Cheung, C., Mustafa, B., Palepu, A., … Natarajan, V. (2024). Capabilities of gemini models in medicine. arXiv. https://doi.org/10.48550/arXiv.2404.18416

Slavin, R., & Smith, D. (2009). The Relationship Between Sample Sizes and Effect Sizes in Systematic Reviews in Education. Educational Evaluation and Policy Analysis, 31(4), 500–506. https://doi.org/10.3102/0162373709352369

Sarstedt, M., Adler, S. J., Rau, L., & Schmitt, B. (2024). Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines. Psychology & Marketing, 41(6), 1254–1270. https://doi.org/10.1002/mar.21982

Schoonenboom, J. (2023). The fundamental difference between qualitative and quantitative data in mixed methods research. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 24(11). https://doi.org/10.17169/fqs-24.1.3986

Serapio-García, G., Safdari, M., Crepy, C., Sun, L., Fitz, S., Romero, P., Abdulhai, M., Faust, A., & Matarić, M. (2023). Personality traits in large language models. arXiv. https://doi.org/10.48550/arXiv.2307.00184

Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv. https://doi.org/10.48550/arXiv.2506.06941

Sun, S., Lee, E., Nan, D., Zhao, X., Lee, W., Jansen, B. J., & Kim, J. H. (2024). Random silicon sampling: Simulating human sub-population opinion using a large language model based on group-level demographic information. arXiv. https://doi.org/10.48550/arXiv.2402.18144

Uto, M. & Uchida, Y. (2020). Automated short-answer grading using deep neural networks and item response theory. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial intelligence in education (pp. 334–339). Springer International Publishing. https://doi.org/10.1007/978-3-030-52240-7_61

Wang, J., Hida, R. M., Park, J., Kim, E. K., & Begeny, J. C. (2024). A systematic review of mixed methods studies published in six school psychology journals: Prevalence, characteristics, and trends from 2011 to 2020. Psychology in the Schools, 61(4), 1302–1317. https://doi.org/10.1002/pits.23114

Wang, P., Zou, H., Yan, Z., Guo, F., Sun, T., Xiao, Z., & Zhang, B. (2024). Not yet: Large language models cannot replace human respondents for psychometric research. OSF Preprint. https://doi.org/10.31219/osf.io/rwy9b

Wang, Q. & Li, H. (2025). On continually tracing origins of LLM-generated text and its application in detecting cheating in student coursework. Big Data and Cognitive Computing, 9(33), 50. https://doi.org/10.3390/bdcc9030050

Wilson, P., Rodgers, W. & Fraser, S. (2002). Examining the Psychometric Properties of the Behavioral Regulation in Exercise Questionnaire. Measurement & Evaluation in Exercise & Sport Science, 6, 1-21. https://doi.org/10.1207/S15327841MPEE0601_1

Wu, S., Koo, M., Blum, L., Black, A., Kao, L., Scalzo, F., & Kurtz, I. (2023). A comparative study of open-source large language models, GPT-4 and claude 2: Multiple-choice test taking in nephrology. arXiv. https://doi.org/10.48550/arXiv.2308.04709

Xu, S. & Zhang, X. (2023). Leveraging generative artificial intelligence to simulate student learning behavior. arXiv. https://doi.org/10.48550/arXiv.2310.19206

Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90–112. https://doi.org/10.1111/bjet.13370

Yu, C., Ye, J., Li, Y., Li, Z., Ferrara, E., Hu, X., & Zhao, Y. (2024). A large-scale simulation on large language models for decision-making in political science. arXiv. https://doi.org/10.48550/arXiv.2412.15291
Section
Special Section: Human-AI Partnership for Qualitative Analysis