Seeing Is Solving: MLLMs, Reasoning, and Refusal in Visual Math

Ethan Croteau; Neil Heffernan

doi:10.5281/zenodo.19420820

Seeing Is Solving: MLLMs, Reasoning, and Refusal in Visual Math

PDF HTML

Published April 5, 2026

DOI: https://doi.org/10.5281/zenodo.19420820

Ethan Croteau

Worcester Polytechnic Institute

Neil Heffernan

Worcester Polytechnic Institute

Abstract

Many middle-school math problems are image-dependent: the diagram or graph carries essential information. This matters for intelligent tutoring and accessibility, where systems must reason over figures and also decline responsibly when figures are missing. We evaluate six contemporary multimodal large language models (MLLMs)—three reasoning models and three non-reasoning models—on 376 Illustrative Mathematics (IM) items labeled as image-role Required (the figure contains task-critical information not recoverable from text alone without added assumptions). Each model attempts every item three times with and without the figure under a shared prompt and scoring protocol. To reduce image-role label subjectivity, we classify items as not Required when they are solvable from text alone without additional assumptions. With images, the top-performing reasoning models achieve accuracy in the mid-50%, while non-reasoning models fall in the mid-30s to low-40s. Without images, models overwhelmingly refuse rather than guess, with only rare correct-by-chance answers. Models show moderate agreement on which items are solvable, and we release two benchmark subsets of items solved consistently across models. A qualitative audit of 83 items shows that visual misreading is the dominant failure mode for non-reasoning models, while reasoning models more often produce correct answers accompanied by adequate explanations. These results suggest tutoring systems should gate automated scoring and learner-model updates on visual-evidence availability and use scaffolds that require explicit visual-evidence binding before algebra. For accessibility, systems should treat no-image refusals as missing-context signals and elicit the figure or a structured description, enabling description-substitution experiments. We release code, prompts, and summary artifacts for replication. Code and data: https://osf.io/ct7bg/

How to Cite

Seeing Is Solving: MLLMs, Reasoning, and Refusal in Visual Math. (2026). Journal of Educational Data Mining, 18(1), 244-285. https://doi.org/10.5281/zenodo.19420820

Abstract 175 | PDF Downloads 119 HTML Downloads 59

Keywords

multimodal large language models, image-dependent math, refusal, reasoning, educational data mining, accessibility

References

Ainsworth, S. 2006. Deft: A conceptual framework for considering learning with multiple representations. Learning and Instruction 16, 3, 183–198.

Anderson, J. R., Corbett, A. T., Koedinger, K. R., and Pelletier, R. 1995. Cognitive tutors: Lessons learned. The Journal of the Learning Sciences 4, 2, 167–207.

Arneson, J. B. and Offerdahl, E. G. 2018. Visual literacy in Bloom: Using Bloom’s taxonomy to support visual learning skills. CBE–Life Sciences Education 17, 1 (Mar), ar7.

ASSETS 2025 Organizing Committee. 2025. Accessibility guidelines. ACM SIGACCESS Conference on Computers and Accessibility. Accessed October 26, 2025.

Badshah, S. and Sajjad, H. 2025. Reference-guided verdict: LLMs-as-judges in automatic evaluation of free-form QA. In Proceedings of the 9th Widening NLP Workshop, C. Zhang, E. Allaway, H. Shen, L. Miculicich, Y. Li, M. M’hamdi, P. Limkonchotiwat, R. H. Bai, S. T.y.s.s., S. S. Han, S. Thapa, and W. B. Rim, Eds. Association for Computational Linguistics, Suzhou, China, 251–267.

Benetech. 2019. Specific guidelines – mathematics. http://diagramcenter.org/specific-guidelines-g.html. DIAGRAM Center, supported by the U.S. Department of Education, Office of Special Education Programs (Cooperative Agreement #H327B100001). Accessed October 26, 2025.

Braille Authority of North America (BANA). 2022. Guidelines and standards for tactile graphics. ISBN (print): 979-8-9883302-0-2; ISBN (braille): 979-8-9883302-1-9.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.

Chen, F., Yuan, H., Xu, Y., Feng, T., Cen, J., Liu, P., Huang, Z., and Yang, Y. 2025. MathFlow: Enhancing the perceptual flow of MLLMs for visual mathematical problems. arXiv preprint arXiv:2503.16549. Code available at https://github.com/MathFlow-zju/MathFlow.

Chen, Z., Hu, W., He, G., Deng, Z., Zhang, Z., and Hong, R. 2025. Unveiling uncertainty: A deep dive into calibration and performance of multimodal large language models. In Proceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 3095–3109.

Closser, A. H., Botelho, A., Chan, J., et al. 2024. Exploring the impact of symbol spacing and problem sequencing on arithmetic performance: An educational data mining approach. Journal of Educational Data Mining 16, 1, 84–111.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1, 37–46.

Corbett, A. T. and Anderson, J. R. 1994. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction 4, 4, 253–278.

DIAGRAM Center. 2015. Image description guidelines. https://diagramcenter.org/making-images-accessible.html. Landing page with HTML and DOCX versions; accessed October 22, 2025.

Eagan, B., Brohinsky, J., Wang, J., and Shaffer, D. W. 2020. Testing the reliability of inter-rater reliability. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge. LAK ’20. Association for Computing Machinery, New York, NY, USA, 454–461.

Evagorou, M., Erduran, S., and Mäntylä, T. 2015. The role of visual representations in scientific practices: From conceptual understanding and knowledge generation to ’seeing’ how science works. International Journal of STEM Education 2, 11 (July), 11.

Feng, J., Wang, Z., Zhang, Z., Guo, Y., Zhou, Z., Chen, X., Li, Z., and Yin, D. 2025. MathReal: We keep it real! A real scene benchmark for evaluating math reasoning in multimodal large language models. arXiv preprint arXiv:2508.06009. Code available at https://github.com/junfeng0288/MathReal.

Heffernan, N. T. and Heffernan, C. L. 2014. The ASSISTments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education 24, 4, 470–497.

Hwang, H., Kim, D., Kim, S., Ye, S., and Seo, M. 2024. Self-explore: Enhancing mathematical reasoning in language models with fine-grained rewards. In Findings of the Association for Computational Linguistics: EMNLP 2024, Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Association for Computational Linguistics, Miami, Florida, USA, 1444–1466.

Ji, H., Qiu, S., Xin, S., Han, S., Chen, Z., Zhang, D., Wang, H., and Yao, H. 2025. From EduVisBench to EduVisAgent: A benchmark and multi-agent framework for reasoning-driven pedagogical visualization. In The 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025. NeurIPS, NeurIPS, Online.

Jiang, Z., Peng, H., Feng, S., Li, F., and Li, D. 2024. LLMs can find mathematical reasoning mistakes by pedagogical chain-of-thought. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, Jeju Island, South Korea, 3439–3447. Main Track.

Kadavath, S., Conerly, T., Askell, A., El-Showk, S., Schiefer, N., Nadler, K., Ladd, A., Ganguli, D., Henighan, T., Jones, A., Bowman, N., Kravec, A., Lovitt, Z., Ndousse, K., Chen, A., Kapadia, T., Amodei, D., Hernandez, D., Drain, D., Ganguli, S., Clark, J., and Kaplan, J. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.

Kalai, A. T., Nachum, O., Vempala, S. S., and Zhang, E. 2025. Why language models hallucinate. arXiv preprint arXiv:2509.04664.

Koedinger, K. R. and Corbett, A. T. 2006. Cognitive tutors: Technology bringing learning science to the classroom. In The Cambridge Handbook of the Learning Sciences, R. K. Sawyer, Ed. Cambridge University Press, New York, NY, 61–77.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems 35, 22199–22213.

Krippendorff, K. 2011. Computing Krippendorff’s Alpha-Reliability. Working Paper 43, University of Pennsylvania, Annenberg School for Communication, Philadelphia, PA. Jan. Postprint version.

Larkin, J. H. and Simon, H. A. 1987. Why a diagram is (sometimes) worth ten thousand words. Cognitive Science 11, 1, 65–100.

Levonian, Z., Henkel, O., Li, C., and Postle, M.-E. 2025. Designing safe and relevant generative chats for math learning in intelligent tutoring systems. Journal of Educational Data Mining 17, 1, 66–97.

Li, C., Liu, Y., Zhang, T., Wang, M., and Huang, H. 2026. VisioMath: Benchmarking figure-based mathematical reasoning in LMMs. ICLR 2026 Poster, OpenReview. https://openreview.net/forum?id=31pzK2VSQQ.

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. 2024. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations. OpenReview.net, Vienna, Austria.

Marriott, K., Lee, B., Butler, M., Cutrell, E., Ellis, K., Goncu, C., Hearst, M., McCoy, K., and Szafir, D. A. 2021. Inclusive data visualization for people with disabilities: A call to action. Interactions 28, 3 (Apr.), 47-51.

Mathematics, I. 2019. Illustrative mathematics, grade 6-8. Available at https://illustrativemathematics.org/. Authored by Illustrative Mathematics.

Mayer, R. E. 2020. Multimedia Learning, 3rd ed. Cambridge University Press, Cambridge, UK.

Mirzadeh, S. I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., and Farajtabar, M. 2025. GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations. OpenReview.net, Singapore.

Nobre, C., Zhu, K., Mörth, E., Pfister, H., and Beyer, J. 2024. Reading between the pixels: Investigating the barriers to visualization literacy. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI ’24. Association for Computing Machinery, New York, NY, USA.

Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744.

Pelánek, R. 2017. Bayesian knowledge tracing, logistic models, and beyond: An overview of learner modeling techniques. User Modeling and User-Adapted Interaction 27, 3, 313–350.

Roschelle, J., Feng, M., Murphy, R. F., and Mason, C. A. 2016. Online mathematics homework increases student achievement. AERA Open 2, 4, 1–12.

Rudman, W., Golovanevsky, M., Bar, A., Palit, V., LeCun, Y., Eickhoff, C., and Singh, R. 2025. Forgotten polygons: Multimodal large language models are shape-blind. In Findings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Association for Computational Linguistics, Vienna, Austria, 11983–11998.

Selent, D., Patikorn, T., and Heffernan, N. T. 2016. ASSISTments dataset from multiple randomized controlled experiments. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale. L@S ’16. Association for Computing Machinery, Edinburgh, Scotland, UK, 181–184.

Shi, W., Hu, Z., Bin, Y., Liu, J., Yang, Y., Ng, S.-K., Bing, L., and Lee, R. K.-W. 2024. Math-LLaVA: Bootstrapping mathematical reasoning for multimodal large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, Florida, USA, 4663–4680.

Shi, Y., Liang, R., and Xu, Y. 2025. EducationQ: Evaluating LLMs’ teaching capabilities through multi-agent dialogue framework. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Association for Computational Linguistics, Vienna, Austria, 32799–32828.

Sun, Y., Zhang, S., Tang, W., Chen, A., Koniusz, P., Zou, K., Xue, Y., and van den Hengel, A. 2026. Math blind: Failures in diagram understanding undermine reasoning in MLLMs. ICLR 2026 Poster, OpenReview. https://openreview.net/forum?id=RtvmTxdQV9.

Sweller, J., van Merriënboer, J. J. G., and Paas, F. G. W. C. 1998. Cognitive architecture and instructional design. Educational Psychology Review 10, 3, 251–296.

Sáez, A. I., Rhomrasi, L., Ahsini, Y., Vinuesa, R., Hoyas, S., Sabater, J. P. G., i Alfonso, M. J. F., and Conejero, J. A. 2025. Evaluating visual mathematics in multimodal LLMs: A multilingual benchmark based on the kangaroo tests. arXiv preprint arXiv:2506.07418.

Tang, L., Kim, G., Zhao, X., Lake, T., Ding, W., Yin, F., Singhal, P., Wadhwa, M., Liu, Z. L., Sprague, Z., Namuduri, R., Hu, B., Rodriguez, J. D., Peng, P., and Durrett, G. 2025. ChartMuseum: Testing visual reasoning capabilities of large vision-language models. In Advances in Neural Information Processing Systems 38. Curran Associates, Inc., San Diego, CA.

Tran, N., Pierce, B., Litman, D., Correnti, R., Matsumura, L. C., et al. 2024. Multi-dimensional performance analysis of large language models for classroom discussion assessment. Journal of Educational Data Mining 16, 2, 304–335.

Trewin, S. 2019. Describing figures. SIGACCESS Resources. Accessed October 26, 2025.

VanLehn, K. 2006. The behavior of tutoring systems. International Journal of Artificial Intelligence in Education 16, 3, 227–265.

VanLehn, K. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist 46, 4, 197–221.

Verga, P., Hofstätter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. 2024. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models. arXiv preprint arXiv:2404.18796.

W3C Accessibility Guidelines Working Group. 2024. Web content accessibility guidelines (wcag) 2.2. W3C Recommendation. Latest version: https://www.w3.org/TR/WCAG22/.

W3C Web Accessibility Initiative. 2022. Images tutorial: Complex images. https://www.w3.org/WAI/tutorials/images/complex/. Updated 17 January 2022. Accessed 24 Oct 2025.

Wang, P., Li, Z.-Z., Yin, F., Ran, D., and Liu, C.-L. 2025. MV-MATH: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, 19541–19551.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35. Curran Associates, Inc., New Orleans, Louisiana, USA.

Whitehill, J. and LoCasale-Crouch, J. 2024. Automated evaluation of classroom instructional support with LLMs and bows: Connecting global predictions to specific feedback. Journal of Educational Data Mining 16, 1, 34–60.

Woolf, B. P. 2008. Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, San Francisco, CA.

Xia, S., Li, X., Liu, Y., Wu, T., and Liu, P. 2025. Evaluating mathematical reasoning beyond accuracy. Proceedings of the AAAI Conference on Artificial Intelligence 39, 26, 27723–27730.

Yan, Y., Su, J., He, J., Fu, F., Zheng, X., Lyu, Y., Wang, K., Wang, S., Wen, Q., and Hu, X. 2025. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. In Findings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 11798–11827.

Yan, Y., Wang, S., Huo, J., Li, H., Li, B., Su, J., Gao, X., Zhang, Y.-F., Xu, T., Chu, Z., Zhong, A., Wang, K., Xiong, H., Yu, P. S., Hu, X., and Wen, Q. 2024. ErrorRadar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection. Submitted to ICLR 2025; OpenReview: https://openreview.net/forum?id=GeTBk67mK6.

Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. 2023. The dawn of LMMs: Preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421.

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. 2024. A survey on multimodal large language models. National Science Review 11, 12 (Nov.), nwae403.

Yin, Z., Sun, Y., Huang, X., Qiu, X., and Zhao, H. 2025. Error classification of large language models on math word problems: A dynamically adaptive framework. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 338–365.

Zhang, L. and Graf, E. A. 2025. Mathematical computation and reasoning errors by large language models. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers. Vol. 2025.aimecon-main. National Council on Measurement in Education (NCME), Pittsburgh, Pennsylvania, United States, 417–424.

Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.-W., Qiao, Y., et al. 2024. MATHVERSE: Does your multi-modal LLM truly see the diagrams in visual math problems? In European Conference on Computer Vision. Springer Nature Switzerland, Cham, 169–186.

Issue

Vol 18 No 1 (2026)

Section

EDM 2026 Journal Track

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Authors who publish with this journal agree to the following terms:

The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:

Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.

The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
The Author represents and warrants that:

the Work is the Author’s original work;
the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
the Work is not pending review or under consideration by another publisher;
the Work has not previously been published;
the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
the Work contains no libel, invasion of privacy, or other unlawful matter.

The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.

Main

Sidebar

Abstract

How to Cite

Details

Most read articles by the same author(s)