Educational data mining involves the application of data mining techniques to student activity. However, in the context of computer programming, many data mining techniques can not be applied because they require vector-shaped input, whereas computer programs have the form of syntax trees. In this paper, we present ast2vec, a neural network that maps Python syntax trees to vectors and back, thereby enabling about a hundred data mining techniques that were previously not applicable. Ast2vec has been trained on almost half a million programs of novice programmers and is designed to be applied across learning tasks without re-training, meaning that users can apply it without any need for deep learning. We demonstrate the generality of ast2vec in three settings. First, we provide example analyses using ast2vec on a classroom-sized dataset, involving two novel techniques, namely progress-variance projection for visualization and a dynamical systems analysis for prediction. In these examples, we also explain how ast2vec can be utilized for educational decisions. Second, we consider the ability of ast2vec to recover the original syntax tree from its vector representation on the training data and two other large-scale programming datasets. Finally, we evaluate the predictive capability of a linear dynamical system on top of ast2vec, obtaining similar results to techniques that work directly on syntax trees while being much faster (constant- instead of linear-time processing). We hope ast2vec can augment the educational data mining toolkit by making analyses of computer programs easier, richer, and more efficient.
How to Cite
computer science education, computer programs, word embeddings, representation learning, neural networks, visualization, program vectors
AHO, A., LAM, M., SETHI, R., AND ULLMAN, J. 2006. Compilers: Principles, Techniques, and Tools, 2 ed. Addison Wesley, Boston, MA, USA.
ALON, U., ZILBERSTEIN, M., LEVY, O., AND YAHAV, E. 2019. Code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, 40.
BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 1, 3–17.
BARNES, T., MOSTAFAVI, B., AND EAGLE, M. J. 2016. Data-driven domain models for problem solving. In Design Recommendations for Intelligent Tutoring Systems, R. Sottilare, A. Graesser, X. Hu, A. Olney, B. Nye, and A. Sinatra, Eds. Vol. 4. USArmy Research Laboratory, Orlando, FL, USA, 137–145.
BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S., AND ZWEIG, G. 1997. Syntactic clustering of the web. Computer Networks and ISDNSystems 29, 8, 1157 – 1166.
CAMPI, M. AND KUMAR, P. 1998. Learning dynamical systems in a stationary environment. Systems & Control Letters 34, 3, 125 – 132.
CHEN, X., LIU, C., AND SONG, D. 2018. Tree-to-tree neural networks for program translation. In Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates Inc., 2552–2562.
CHO, K., VAN MERRIENBOER, B., GÜLÇEHRE, Ç., BOUGARES, F., SCHWENK, H., AND BENGIO, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, and W. Daelemans, Eds. 1724–1734.
CHOUDHURY, R. R., YIN, H., MOGHADAM, J., AND FOX, A. 2016. Autostyle: Toward coding style feedback at scale. In Proceedings of the 19th ACMConference on Computer Supported Cooperative Work and Social Computing Companion, D. Gergle, M. R. Morris, P. Bjørn, J. Konstan, G. Hsieh, and N. Yamashita, Eds. 21–24.
DAI, H., TIAN, Y., D AI, B., SKIENA, S., AND SONG, L. 2018. Syntax-directed variational autoencoder for structured data. In Proceedings of the 6th International Conference on Learning Representations, Y. Bengio, Y. LeCun, T. Sainath, I. Murray, M. A. Ranzato, and O. Vinyals, Eds.
DEMPSTER, A. P., LAIRD, N. M., AND RUBIN, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1, 1–38.
DENNING, P. J. 2017. Remaining trouble spots with computational thinking. Communications of the ACM 60, 6, 33–39.
GISBRECHT, A. AND SCHLEIF, F.-M. 2015. Metric and non-metric proximity transformations at linear costs. Neurocomputing 167, 643–657.
GLASSMAN, E. L., SCOTT, J., SINGH, R., GUO, P. J., AND MILLER, R. C. 2015. Overcode: Visualizing variation in student solutions to programming problems at scale. ACM Transactions on Computer-Human Interaction 22, 2, 7.
GROSS, S., MOKBEL, B., PAASSEN, B., HAMMER, B., AND PINKWART, N. 2014. Example-based feedback provision using structured solution spaces. International Journal of Learning Technology 9, 3, 248–280.
GROSS, S. AND PINKWART, N. 2015. How do learners behave in help-seeking when given a choice? In Proceedings of the 17th International Conference on Artificial Intelligence in Education, C. Conati, N. Heffernan, A. Mitrovic, and M. F. Verdejo, Eds. 600–603.
GULWANI, S., RADIČEK, I., AND ZULEGER, F. 2018. Automated clustering and program repair for introductory programming assignments. ACM SIGPLAN Notices 53, 4, 465–480.
HAMMER, B. AND HASENFUSS, A. 2010. Topographic mapping of large dissimilarity data sets. Neural Computation 22, 9, 2229–2284.
HAZAN, E., SINGH, K., AND ZHANG, C. 2017. Learning linear dynamical systems via spectral filtering. In Proceedings of the 30th Conference on Advances Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 6702–6712.
HYNDMAN, R. J. AND KOEHLER, A. B. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4, 679 – 688.
IHANTOLA, P., AHONIEMI, T., KARAVIRTA, V., AND SEPPÄLÄ, O. 2010. Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research, C. Schulte and J. Suhonen, Eds. 86–93.
KINGMA, D. AND BA, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, Y. Bengio, Y. LeCun, B. Kingsbury, S. Bengio, N. de Freitas, and H. Larochelle, Eds.
KINGMA, D. AND WELLING, M. 2013. Auto-encoding variational Bayes. In Proceedings of the 1st International Conference on Learning Representations, Y. Bengio, Y. LeCun, A. Courville, R. Fergus, and C. Manning, Eds.
KIPF, T. N. AND WELLING, M. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the 4th International Conference on Learning Representations, H. Larochelle, S. Bengio, B. Kingsbury, Y. Bengio, and Y. LeCun, Eds.
KOPRINSKA, I., STRETTON, J., AND YACEF, K. 2015. Predicting student performance from multiple data sources. In Proceedings of the International Conference on Artificial Intelligence in Education, C. Conati, N. Heffernan, A. Mitrovic, and M. Verdejo, Eds. 678–681.
KUSNER, M. J., PAIGE, B., AND HERNÁNDEZ-LOBATO, J. M. 2017. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, Eds. 1945–1954.
LAHTINEN, E., ALA-MUTKA, K., AND JÄRVINEN, H.-M. 2005. A study of the difficulties of novice programmers. SIGCSE Bulletin 37, 3, 14–18.
LAN, A. S., STUDER, C., AND BARANIUK, R. G. 2014. Time-varying learning and content analytics via sparse factor analysis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, S. Macskassy, C. Perlich, J. Leskovec, W. Wang, and R. Ghani, Eds. 452–461.
MCBROOM, J., KOPRINSKA, I., AND YACEF, K. 2020. How does student behaviour change approaching dropout? A study of gender and school year differences. In Proceedings of the 13th International Conference on Educational Data Mining, A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, Eds. 643–647.
MCBROOM, J., KOPRINSKA, I., AND YACEF, K. 2021. A survey of automated programming hint generation: The hints framework. ACM Computing Surveys 54, 8, 172.
MCBROOM, J., PAASSEN, B., JEFFRIES, B., KOPRINSKA, I., AND YACEF, K. 2021. Progress networks as a tool for analysing student programming difficulties. In Proceedings of the Twenty-Third Australasian Conference on Computing Education, C. Szabo and J. Sheard, Eds. 158–167.
MCBROOM, J., YACEF, K., KOPRINSKA, I., AND CURRAN, J. R. 2018. A data-driven method for helping teachers improve feedback in computer programming automated tutors. In Artificial Intelligence in Education, C. Penstein Rosé, R. Martı́nez-Maldonado, H. U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, and B. du Boulay, Eds. 324–337.
MCCRACKEN, M., ALMSTRUM, V., D IAZ, D., GUZDIAL, M., HAGAN, D., KOLIKANT, Y. B.-D., LAXER, C., THOMAS, L., UTTING, I., AND WILUSZ, T. 2001. A multi-national, multi-institutional study of assessment of programming skills of first-year CS students. In Working Group Reports from ITiCSE on Innovation and Technology in Computer Science Education. 125–180.
MIKOLOV, T., SUTSKEVER, I., CHEN, K., CORRADO, G. S., AND DEAN, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. 3111–3119.
MOKBEL, B., GROSS, S., PAASSEN, B., PINKWART, N., AND HAMMER, B. 2013. Domain-independent proximity measures in intelligent tutoring systems. In Proceedings of the 6th International Conference on Educational Data Mining, S. K. D'Mello, R. A. Calvo, and A. Olney, Eds. 334–335.
NGUYEN, A., PIECH, C., HUANG, J., AND GUIBAS, L. 2014. Codewebs: Scalable homework search for massive open online programming courses. In Proceedings of the 23rd International Conference on World Wide Web, C.-W. Chung, A. Broder, K. Shim, and T. Suel, Eds. Association for Computing Machinery, 491–502.
PAASSEN, B., GÖPFERT, C., AND HAMMER, B. 2018. Time series prediction for graphs in kernel and dissimilarity spaces. Neural Processing Letters 48, 2, 669–689.
PAASSEN, B., HAMMER, B., PRICE, T., BARNES, T., GROSS, S., AND PINKWART, N. 2018. The continuous hint factory providing hints in vast and sparsely populated edit distance spaces. Journal of Educational Data Mining 10, 1, 1–35.
PAASSEN, B., JENSEN, J., AND HAMMER, B. 2016. Execution traces as a powerful data representation for intelligent tutoring systems for programming. In Proceedings of the 9th International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. International Educational Data Mining Society, 183 – 190.
PAASSEN, B., KOPRINSKA, I., AND YACEF, K. 2021. Recursive tree grammar autoencoders. arXiv 2012.02097. https://arxiv.org/abs/2012.02097.
PASZKE, A., GROSS, S., MASSA, F., LERER, A., BRADBURY, J., CHANAN, G., KILLEEN, T., LIN, Z., GIMELSHEIN, N., ANTIGA, L., DESMAISON, A., KOPF, A., YANG, E., D EVITO, Z., RAISON, M., TEJANI, A., CHILAMKURTHY, S., STEINER, B., FANG, L., BAI, J., AND CHINTALA, S. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché Buc, E. Fox, and R. Garnett, Eds. 8026–8037.
PEARSON, K. 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11, 559–572.
PEDDYCORD, III, B., HICKS, A., AND BARNES, T. 2014. Generating hints for programming problems using intermediate output. In Proceedings of the 7th International Conference on Educational Data Mining, M. Mavrikis and B. M. McLaren, Eds. 92–98.
PEKALSKA, E. AND DUIN, R. 2005. The Dissimilarity Representation for Pattern Recognition: Foundations And Applications (Machine Perception and Artificial Intelligence). World Scientific Publishing Co., Inc., River Edge, NJ, USA.
PENNINGTON, J., SOCHER, R., AND MANNING, C. D. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, and W. Daelemans, Eds. 1532–1543.
PIECH, C., HUANG, J., NGUYEN, A., PHULSUKSOMBATI, M., SAHAMI, M., AND GUIBAS, L. 2015. Learning program embeddings to propagate feedback on student code. In Proceedings of the 37th International Conference on Machine Learning, F. Bach and D. Blei, Eds. 1093–1102.
PIECH, C., SAHAMI, M., HUANG, J., AND GUIBAS, L. 2015. Autonomously generating hints by inferring problem solving policies. In Proceedings of the Second ACM Conference on Learning @ Scale, G. Kiczales, D. Russell, and B. Woolf, Eds. 195–204.
POLITI, A. 2013. Lyapunov exponent. Scholarpedia 8, 3, 2722.
PRICE, T., ZHI, R., AND BARNES, T. 2017. Evaluation of a data-driven feedback algorithm for open-ended programming. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, and P. Inventado, Eds. 192–197.
PRICE, T. W., DONG, Y., ZHI, R., PAASSEN, B., LYTLE, N., CATETÉ, V., AND BARNES, T. 2019. A comparison of the quality of data-driven programming hint generation algorithms. International Journal of Artificial Intelligence in Education 29, 3, 368–395.
QIAN, Y. AND LEHMAN, J. 2017. Students' misconceptions and other difficulties in introductory programming: A literature review. ACM Transactions on Computing Education 18, 1, 1–25.
REDDY, S., LABUTOV, I., AND JOACHIMS, T. 2016. Latent skill embedding for personalized lesson sequence recommendation. arXiv 1602.07029.
RIVERS, K. AND KOEDINGER, K. R. 2012. A canonicalizing model for building programming tutors. In Proceedings of the 11th International Conference on Intelligent Tutoring Systems, (ITS 2012), S. A. Cerri, W. J. Clancey, G. Papadourakis, and K. Panourgia, Eds. 591–593.
RIVERS, K. AND KOEDINGER, K. R. 2017. Data-driven hint generation in vast solution spaces: a self-improving python programming tutor. International Journal of Artificial Intelligence in Education 27, 1, 37–64.
ROBINS, A., ROUNTREE, J., AND ROUNTREE, N. 2003. Learning and teaching programming: A review and discussion. Computer Science Education 13, 2, 137–172.
SCARSELLI, F., GORI, M., TSOI, A. C., HAGENBUCHNER, M., AND MONFARDINI, G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20, 1, 61–80.
TAI, K. S., SOCHER, R., AND MANNING, C. D. 2015. Improved semantic representations from tree- structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistic, Y. Matsumoto, C. Zong, and M. Strube, Eds. 1556–1566.
TRUONG, N., ROE, P., AND BANCROFT, P. 2004. Static analysis of students' Java programs. In Proceedings of the Sixth Australasian Conference on Computing Education, R. Lister and A. Young, Eds. 317–325.
WILES, R., DURRANT, G., BROE, S. D., AND POWELL, J. 2009. Methodological approaches at PhD and skills sought for research posts in academia: A mismatch? International Journal of Social Research Methodology 12, 3, 257–269.
ZHANG, K. AND SHASHA, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18, 6, 1245–1262.
ZHI, R., PRICE, T. W., LYTLE, N., DONG, Y., AND BARNES, T. 2018. Reducing the state space of programming problems through data-driven feature detection. In Proceedings of the second Educational Data Mining in Computer Science Education Workshop, D. Azcona, S. Hsiao, N.-T. Le, J. Stamper, and M. Yudelson, Eds.
ZIMMERMAN, K. AND RUPAKHETI, C. R. 2015. An automated framework for recommending program elements to novices (N). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), M. Cohen, L. Grunske, and M. Whalen, Eds. 283–288.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.