Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI‑Generated and Human Data
Main
Sidebar
Abstract
Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. Item-level analysis revealed higher discrepancies for negatively worded questions, suggesting LLMs struggle with emotional nuance. Person-level differences indicated varying model performance across respondents, highlighting the role of interview relevance over length. Despite replicating individual item trends, LLMs faltered in reconstructing the test’s psychometric structure. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research. The R code and the supplementary materials are available on the OSF platform (DOI:10.17605/OSF.IO/AFQG3).
How to Cite
Details
quantitative data, qualitative data, LLM-driven interview, survey, behavioral regulations in exercise
Anthropic (2025). Claude 2 [Large language model]. https://www.anthropic.com/news/claude-2
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337–351. https://doi.org/10.1017/pan.2023.2
Binz, M. & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences, 120(6), e2218523120. https://doi.org/10.1073/pnas.2218523120
Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(4), 401-416. https://doi.org/10.1017/pan.2024.5
Bishop, F. L. (2015). Using mixed methods research designs in health psychology: An illustrated discussion from a pragmatist perspective. British Journal of Health Psychology, 20(1), 5–20. https://doi.org/10.1111/bjhp.12122
Chang, S., Chaszczewicz, A., Wang, E., Josifovska, M., Pierson, E., & Leskovec, J. (2024). LLMs generate structurally realistic social networks but overestimate political homophily. arXiv. https://doi.org/10.48550/arXiv.2408.16629
Chen, J., Wang, X., Xu, R., Yuan, S., Zhang, Y., Shi, W., Xie, J., Li, S., Yang, R., Zhu, T., Chen, A., Li, N., Chen, L., Hu, C., Wu, S., Ren, S., Fu, Z., & Xiao, Y. (2024). From persona to personalization: A survey on role-playing language agents. arXiv. https://doi.org/10.48550/arXiv.2404.18231
Cid, L., Moutão, J., Leitão, J., & Alves, J. (2012). Behavioral Regulation Assessment in Exercise: Exploring an Autonomous and Controlled Motivation Index. The Spanish Journal of Psychology, 15(3), 1520–1528. https://doi.org/10.5209/rev_SJOP.2012.v15.n3.39436
Creswell, J. W. & Clark, V. L. P. (2017). Designing and conducting mixed methods research. SAGE Publications.
Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008
Ding, M., Deng, C., Choo, J., Wu, Z., Agrawal, A., Schwarzschild, A., Zhou, T., Goldstein, T., Langford, J., Anandkumar, A., & Huang, F. (2024). Easy2Hard-bench: Standardized difficulty labels for profiling LLM performance and generalization. Advances in Neural Information Processing Systems, 37, 44323–44365. https://proceedings.neurips.cc/paper_files/paper/2024/hash/4e6f22305275966513990f53cec908e0-Abstract-Datasets_and_Benchmarks_Track.html
Ekin, S. (2023). Prompt engineering for ChatGPT: A quick guide to techniques, tips, and best practices. arXiv. https://www.authorea.com/doi/full/10.36227/techrxiv.22683919?commit=95e67146c79e1ed93e4caa2eb930eb0984abec35
Fateen, M. & Mine, T. (2025). Developing a tutoring dialog dataset to optimize LLMs for educational use. arXiv. https://doi.org/10.48550/arXiv.2410.19231
Federiakin, D., Molerov, D., Zlatkin-Troitschanskaia, O., & Maur, A. (2024). Prompt engineering as a new 21st century skill. Frontiers in Education, 9. https://doi.org/10.3389/feduc.2024.1366434
Ge, T., Chan, X., Wang, X., Yu, D., Mi, H., & Yu, D. (2024). Scaling synthetic data creation with 1,000,000,000 personas. arXiv. https://doi.org/10.48550/arXiv.2406.20094
Huang, J., Jiao, W., Lam, M. H., Li, E. J., Wang, W., & Lyu, M. R. (2024). Revisiting the reliability of psychological scales on large language models. arXiv. https://doi.org/10.48550/arXiv.2305.19926
Huang, J., Wang, W., Li, E. J., Lam, M. H., Ren, S., Yuan, Y., Jiao, W., Tu, Z., & Lyu, M. R. (2024). Who is ChatGPT? Benchmarking LLMs’ psychological portrayal using PsychoBench. arXiv. https://doi.org/10.48550/arXiv.2310.01386
Jansen, B. J., Salminen, J., Jung, S., & Guan, K. (2022). Data-driven personas. Springer Nature.
Jiang, H., Zhang, X., Cao, X., Breazeal, C., Roy, D., & Kabbara, J. (2023). PersonaLLM: Investigating the ability of large language models to express personality traits. arXiv. https://doi.org/10.48550/arXiv.2305.02547
Johnson, R. B., Onwuegbuzie, A. J., & Turner, L. A. (2007). Toward a definition of mixed methods research. Journal of Mixed Methods Research, 1(2), 112–133. https://doi.org/10.1177/1558689806298224
Laverghetta Jr., A. & Licato, J. (2023). Generating better items for cognitive assessments using large language models. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) (pp. 414–428). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.34
Li, A., Chen, H., Namkoong, H., & Peng, T. (2025). LLM generated persona is a promise with a catch. arXiv. https://doi.org/10.48550/arXiv.2503.16527
Li, Y., Huang, Y., Wang, H., Zhang, X., Zou, J., & Sun, L. (2024). Quantifying AI psychology: A psychometrics benchmark for large language models. arXiv. https://doi.org/10.48550/arXiv.2406.17675
Liu, Y., Bhandari, S., & Pardos, Z. A. (2025). Leveraging LLM respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, 56(3), 1028–1052. https://doi.org/10.1111/bjet.13570
Liu, Y., Sharma, P., Oswal, M. J., Xia, H., & Huang, Y. (2024). PersonaFlow: Boosting research ideation with LLM-simulated expert personas. arXiv. https://doi.org/10.48550/arXiv.2409.12538
Lozić, E. & Štular, B. (2023). Fluent but not factual: A comparative analysis of ChatGPT and other AI chatbots’ proficiency and originality in scientific writing for humanities. Future Internet, 15(1010), 336. https://doi.org/10.3390/fi15100336
Mancoridis, M., Weeks, B., Vafa, K., & Mullainathan, S. (2025, June 18). Potemkin Understanding in Large Language Models. Forty-second International Conference on Machine Learning. https://openreview.net/forum?id=oetxkccLoq
May, T. A., Stone, G. E., Fan, Y., Sondergeld, C. J., LaPlante, J. N., Provinzano, K., Koskey, K. L. K., & Johnson, C. C. (2025). Using generative artificial intelligence tools to develop multiple-choice assessment items: An effectiveness study. American Educational Research Association.
Mendonça, P. C., Quintal, F., & Mendonça, F. (2025). Evaluating LLMs for automated scoring in formative assessments. Applied Sciences, 15(55), 2787. https://doi.org/10.3390/app15052787
Mullan, E., Markland, D., & Ingledew, D. (1997). A graded conceptualisation of self-determination in the regulation of exercise behaviour: development of a measure using confirmatory factor analytic procedures. Pers Individ Differ, 23, 745–752. https://doi.org/10.1016/S0191-8869(97)00107-4
Mullan, E. & Markland, D. (1997), Variations in self-determination across the stages of change for exercise in adult. Motivation and Emotion, 21, 349–362. https://doi.org/10.1023/A:1024436423492
Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on medical challenge problems. arXiv. https://doi.org/10.48550/arXiv.2303.13375
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., …& Zoph, B. (2023). GPT-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774
Parker, M. J., Anderson, C., Stone, C., & Oh, Y. (2024). A Large Language Model Approach to Educational Survey Feedback Analysis. International Journal of Artificial Intelligence in Education. https://doi.org/10.1007/s40593-024-00414-0
Peng, Q., Liu, H., Xu, H., Yang, Q., Shao, M., & Wang, W. (2024). Review-LLM: Harnessing large language models for personalized review generation. arXiv. https://doi.org/10.48550/arXiv.2407.07487
Ponce, O. A. & Pagán-Maldonado, N. (2015). Mixed methods research in education: Capturing the complexity of the profession. International Journal of Educational Excellence, 1(1), 111–135. https://doi.org/10.18562/ijee.2015.0005
Powell, H., Mihalas, S., Onwuegbuzie, A. J., Suldo, S., & Daley, C. E. (2008). Mixed methods research in school psychology: A mixed methods investigation of trends in the literature. Psychology in the Schools, 45(4), 291–309. https://doi.org/10.1002/pits.20296
Rasheed, Z., Waseem, M., Ahmad, A., Kemell, K.-K., Xiaofeng, W., Duc, A. N., & Abrahamsson, P. (2024). Can large language models serve as data analysts? A multi-agent assisted approach for qualitative data analysis. arXiv. https://doi.org/10.48550/arXiv.2402.01386
Saab, K., Tu, T., Weng, W.-H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., Chaves, J. Z., Hu, S.-Y., Schaekermann, M., Kamath, A., Cheng, Y., Barrett, D. G. T., Cheung, C., Mustafa, B., Palepu, A., … Natarajan, V. (2024). Capabilities of gemini models in medicine. arXiv. https://doi.org/10.48550/arXiv.2404.18416
Slavin, R., & Smith, D. (2009). The Relationship Between Sample Sizes and Effect Sizes in Systematic Reviews in Education. Educational Evaluation and Policy Analysis, 31(4), 500–506. https://doi.org/10.3102/0162373709352369
Sarstedt, M., Adler, S. J., Rau, L., & Schmitt, B. (2024). Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines. Psychology & Marketing, 41(6), 1254–1270. https://doi.org/10.1002/mar.21982
Schoonenboom, J. (2023). The fundamental difference between qualitative and quantitative data in mixed methods research. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 24(11). https://doi.org/10.17169/fqs-24.1.3986
Serapio-García, G., Safdari, M., Crepy, C., Sun, L., Fitz, S., Romero, P., Abdulhai, M., Faust, A., & Matarić, M. (2023). Personality traits in large language models. arXiv. https://doi.org/10.48550/arXiv.2307.00184
Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv. https://doi.org/10.48550/arXiv.2506.06941
Sun, S., Lee, E., Nan, D., Zhao, X., Lee, W., Jansen, B. J., & Kim, J. H. (2024). Random silicon sampling: Simulating human sub-population opinion using a large language model based on group-level demographic information. arXiv. https://doi.org/10.48550/arXiv.2402.18144
Uto, M. & Uchida, Y. (2020). Automated short-answer grading using deep neural networks and item response theory. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial intelligence in education (pp. 334–339). Springer International Publishing. https://doi.org/10.1007/978-3-030-52240-7_61
Wang, J., Hida, R. M., Park, J., Kim, E. K., & Begeny, J. C. (2024). A systematic review of mixed methods studies published in six school psychology journals: Prevalence, characteristics, and trends from 2011 to 2020. Psychology in the Schools, 61(4), 1302–1317. https://doi.org/10.1002/pits.23114
Wang, P., Zou, H., Yan, Z., Guo, F., Sun, T., Xiao, Z., & Zhang, B. (2024). Not yet: Large language models cannot replace human respondents for psychometric research. OSF Preprint. https://doi.org/10.31219/osf.io/rwy9b
Wang, Q. & Li, H. (2025). On continually tracing origins of LLM-generated text and its application in detecting cheating in student coursework. Big Data and Cognitive Computing, 9(33), 50. https://doi.org/10.3390/bdcc9030050
Wilson, P., Rodgers, W. & Fraser, S. (2002). Examining the Psychometric Properties of the Behavioral Regulation in Exercise Questionnaire. Measurement & Evaluation in Exercise & Sport Science, 6, 1-21. https://doi.org/10.1207/S15327841MPEE0601_1
Wu, S., Koo, M., Blum, L., Black, A., Kao, L., Scalzo, F., & Kurtz, I. (2023). A comparative study of open-source large language models, GPT-4 and claude 2: Multiple-choice test taking in nephrology. arXiv. https://doi.org/10.48550/arXiv.2308.04709
Xu, S. & Zhang, X. (2023). Leveraging generative artificial intelligence to simulate student learning behavior. arXiv. https://doi.org/10.48550/arXiv.2310.19206
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90–112. https://doi.org/10.1111/bjet.13370
Yu, C., Ye, J., Li, Y., Li, Z., Ferrara, E., Hu, X., & Zhao, Y. (2024). A large-scale simulation on large language models for decision-making in political science. arXiv. https://doi.org/10.48550/arXiv.2412.15291

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.