Mapping Python Programs to Vectors using Recursive Neural Encodings

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Oct 31, 2021
Benjamin Paassen Jessica McBroom Bryn Jeffries Irena Koprinska Kalina Yacef

Abstract

Educational data mining involves the application of data mining techniques to student activity. However, in the context of computer programming, many data mining techniques can not be applied because they require vector-shaped input, whereas computer programs have the form of syntax trees. In this paper, we present ast2vec, a neural network that maps Python syntax trees to vectors and back, thereby enabling about a hundred data mining techniques that were previously not applicable. Ast2vec has been trained on almost half a million programs of novice programmers and is designed to be applied across learning tasks without re-training, meaning that users can apply it without any need for deep learning. We demonstrate the generality of ast2vec in three settings. First, we provide example analyses using ast2vec on a classroom-sized dataset, involving two novel techniques, namely progress-variance projection for visualization and a dynamical systems analysis for prediction. In these examples, we also explain how ast2vec can be utilized for educational decisions. Second, we consider the ability of ast2vec to recover the original syntax tree from its vector representation on the training data and two other large-scale programming datasets. Finally, we evaluate the predictive capability of a linear dynamical system on top of ast2vec, obtaining similar results to techniques that work directly on syntax trees while being much faster (constant- instead of linear-time processing). We hope ast2vec can augment the educational data mining toolkit by making analyses of computer programs easier, richer, and more efficient.

How to Cite

Paassen, B., McBroom, J., Jeffries, B., Koprinska, I., & Yacef, K. (2021). Mapping Python Programs to Vectors using Recursive Neural Encodings. Journal of Educational Data Mining, 13(3), 1–35. https://doi.org/10.5281/zenodo.5634224
Abstract 776 | PDF Downloads 447

##plugins.themes.bootstrap3.article.details##

Keywords

computer science education, computer programs, word embeddings, representation learning, neural networks, visualization, program vectors

References
AGGARWAL, C. C., HINNEBURG, A., AND KEIM, D. A. 2001. On the surprising behavior of distance metrics in high dimensional space. In Proceedings of the International Conference on Database Theory (ICDT 2001), J. Van den Bussche and V. Vianu, Eds. Springer, Berlin, Heidelberg, 420–434.

AHO, A., LAM, M., SETHI, R., AND ULLMAN, J. 2006. Compilers: Principles, Techniques, and Tools, 2 ed. Addison Wesley, Boston, MA, USA.

ALON, U., ZILBERSTEIN, M., LEVY, O., AND YAHAV, E. 2019. Code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, 40.

BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 1, 3–17.

BARNES, T., MOSTAFAVI, B., AND EAGLE, M. J. 2016. Data-driven domain models for problem solving. In Design Recommendations for Intelligent Tutoring Systems, R. Sottilare, A. Graesser, X. Hu, A. Olney, B. Nye, and A. Sinatra, Eds. Vol. 4. USArmy Research Laboratory, Orlando, FL, USA, 137–145.

BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S., AND ZWEIG, G. 1997. Syntactic clustering of the web. Computer Networks and ISDNSystems 29, 8, 1157 – 1166.

CAMPI, M. AND KUMAR, P. 1998. Learning dynamical systems in a stationary environment. Systems & Control Letters 34, 3, 125 – 132.

CHEN, X., LIU, C., AND SONG, D. 2018. Tree-to-tree neural networks for program translation. In Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates Inc., 2552–2562.

CHO, K., VAN MERRIENBOER, B., GÜLÇEHRE, Ç., BOUGARES, F., SCHWENK, H., AND BENGIO, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, and W. Daelemans, Eds. 1724–1734.

CHOUDHURY, R. R., YIN, H., MOGHADAM, J., AND FOX, A. 2016. Autostyle: Toward coding style feedback at scale. In Proceedings of the 19th ACMConference on Computer Supported Cooperative Work and Social Computing Companion, D. Gergle, M. R. Morris, P. Bjørn, J. Konstan, G. Hsieh, and N. Yamashita, Eds. 21–24.

DAI, H., TIAN, Y., D AI, B., SKIENA, S., AND SONG, L. 2018. Syntax-directed variational autoencoder for structured data. In Proceedings of the 6th International Conference on Learning Representations, Y. Bengio, Y. LeCun, T. Sainath, I. Murray, M. A. Ranzato, and O. Vinyals, Eds.

DEMPSTER, A. P., LAIRD, N. M., AND RUBIN, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1, 1–38.

DENNING, P. J. 2017. Remaining trouble spots with computational thinking. Communications of the ACM 60, 6, 33–39.

GISBRECHT, A. AND SCHLEIF, F.-M. 2015. Metric and non-metric proximity transformations at linear costs. Neurocomputing 167, 643–657.

GLASSMAN, E. L., SCOTT, J., SINGH, R., GUO, P. J., AND MILLER, R. C. 2015. Overcode: Visualizing variation in student solutions to programming problems at scale. ACM Transactions on Computer-Human Interaction 22, 2, 7.

GROSS, S., MOKBEL, B., PAASSEN, B., HAMMER, B., AND PINKWART, N. 2014. Example-based feedback provision using structured solution spaces. International Journal of Learning Technology 9, 3, 248–280.

GROSS, S. AND PINKWART, N. 2015. How do learners behave in help-seeking when given a choice? In Proceedings of the 17th International Conference on Artificial Intelligence in Education, C. Conati, N. Heffernan, A. Mitrovic, and M. F. Verdejo, Eds. 600–603.

GULWANI, S., RADIČEK, I., AND ZULEGER, F. 2018. Automated clustering and program repair for introductory programming assignments. ACM SIGPLAN Notices 53, 4, 465–480.

HAMMER, B. AND HASENFUSS, A. 2010. Topographic mapping of large dissimilarity data sets. Neural Computation 22, 9, 2229–2284.

HAZAN, E., SINGH, K., AND ZHANG, C. 2017. Learning linear dynamical systems via spectral filtering. In Proceedings of the 30th Conference on Advances Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 6702–6712.

HYNDMAN, R. J. AND KOEHLER, A. B. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4, 679 – 688.

IHANTOLA, P., AHONIEMI, T., KARAVIRTA, V., AND SEPPÄLÄ, O. 2010. Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research, C. Schulte and J. Suhonen, Eds. 86–93.

KINGMA, D. AND BA, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, Y. Bengio, Y. LeCun, B. Kingsbury, S. Bengio, N. de Freitas, and H. Larochelle, Eds.

KINGMA, D. AND WELLING, M. 2013. Auto-encoding variational Bayes. In Proceedings of the 1st International Conference on Learning Representations, Y. Bengio, Y. LeCun, A. Courville, R. Fergus, and C. Manning, Eds.

KIPF, T. N. AND WELLING, M. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the 4th International Conference on Learning Representations, H. Larochelle, S. Bengio, B. Kingsbury, Y. Bengio, and Y. LeCun, Eds.

KOPRINSKA, I., STRETTON, J., AND YACEF, K. 2015. Predicting student performance from multiple data sources. In Proceedings of the International Conference on Artificial Intelligence in Education, C. Conati, N. Heffernan, A. Mitrovic, and M. Verdejo, Eds. 678–681.

KUSNER, M. J., PAIGE, B., AND HERNÁNDEZ-LOBATO, J. M. 2017. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, Eds. 1945–1954.

LAHTINEN, E., ALA-MUTKA, K., AND JÄRVINEN, H.-M. 2005. A study of the difficulties of novice programmers. SIGCSE Bulletin 37, 3, 14–18.

LAN, A. S., STUDER, C., AND BARANIUK, R. G. 2014. Time-varying learning and content analytics via sparse factor analysis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, S. Macskassy, C. Perlich, J. Leskovec, W. Wang, and R. Ghani, Eds. 452–461.

MCBROOM, J., KOPRINSKA, I., AND YACEF, K. 2020. How does student behaviour change approaching dropout? A study of gender and school year differences. In Proceedings of the 13th International Conference on Educational Data Mining, A. Rafferty, J. Whitehill, V. Cavalli-Sforza, and C. Romero, Eds. 643–647.

MCBROOM, J., KOPRINSKA, I., AND YACEF, K. 2021. A survey of automated programming hint generation: The hints framework. ACM Computing Surveys 54, 8, 172.

MCBROOM, J., PAASSEN, B., JEFFRIES, B., KOPRINSKA, I., AND YACEF, K. 2021. Progress networks as a tool for analysing student programming difficulties. In Proceedings of the Twenty-Third Australasian Conference on Computing Education, C. Szabo and J. Sheard, Eds. 158–167.

MCBROOM, J., YACEF, K., KOPRINSKA, I., AND CURRAN, J. R. 2018. A data-driven method for helping teachers improve feedback in computer programming automated tutors. In Artificial Intelligence in Education, C. Penstein Rosé, R. Martı́nez-Maldonado, H. U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, and B. du Boulay, Eds. 324–337.

MCCRACKEN, M., ALMSTRUM, V., D IAZ, D., GUZDIAL, M., HAGAN, D., KOLIKANT, Y. B.-D., LAXER, C., THOMAS, L., UTTING, I., AND WILUSZ, T. 2001. A multi-national, multi-institutional study of assessment of programming skills of first-year CS students. In Working Group Reports from ITiCSE on Innovation and Technology in Computer Science Education. 125–180.

MIKOLOV, T., SUTSKEVER, I., CHEN, K., CORRADO, G. S., AND DEAN, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. 3111–3119.

MOKBEL, B., GROSS, S., PAASSEN, B., PINKWART, N., AND HAMMER, B. 2013. Domain-independent proximity measures in intelligent tutoring systems. In Proceedings of the 6th International Conference on Educational Data Mining, S. K. D'Mello, R. A. Calvo, and A. Olney, Eds. 334–335.

NGUYEN, A., PIECH, C., HUANG, J., AND GUIBAS, L. 2014. Codewebs: Scalable homework search for massive open online programming courses. In Proceedings of the 23rd International Conference on World Wide Web, C.-W. Chung, A. Broder, K. Shim, and T. Suel, Eds. Association for Computing Machinery, 491–502.

PAASSEN, B., GÖPFERT, C., AND HAMMER, B. 2018. Time series prediction for graphs in kernel and dissimilarity spaces. Neural Processing Letters 48, 2, 669–689.

PAASSEN, B., HAMMER, B., PRICE, T., BARNES, T., GROSS, S., AND PINKWART, N. 2018. The continuous hint factory providing hints in vast and sparsely populated edit distance spaces. Journal of Educational Data Mining 10, 1, 1–35.

PAASSEN, B., JENSEN, J., AND HAMMER, B. 2016. Execution traces as a powerful data representation for intelligent tutoring systems for programming. In Proceedings of the 9th International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. International Educational Data Mining Society, 183 – 190.

PAASSEN, B., KOPRINSKA, I., AND YACEF, K. 2021. Recursive tree grammar autoencoders. arXiv 2012.02097. https://arxiv.org/abs/2012.02097.

PASZKE, A., GROSS, S., MASSA, F., LERER, A., BRADBURY, J., CHANAN, G., KILLEEN, T., LIN, Z., GIMELSHEIN, N., ANTIGA, L., DESMAISON, A., KOPF, A., YANG, E., D EVITO, Z., RAISON, M., TEJANI, A., CHILAMKURTHY, S., STEINER, B., FANG, L., BAI, J., AND CHINTALA, S. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché Buc, E. Fox, and R. Garnett, Eds. 8026–8037.

PEARSON, K. 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11, 559–572.

PEDDYCORD, III, B., HICKS, A., AND BARNES, T. 2014. Generating hints for programming problems using intermediate output. In Proceedings of the 7th International Conference on Educational Data Mining, M. Mavrikis and B. M. McLaren, Eds. 92–98.

PEKALSKA, E. AND DUIN, R. 2005. The Dissimilarity Representation for Pattern Recognition: Foundations And Applications (Machine Perception and Artificial Intelligence). World Scientific Publishing Co., Inc., River Edge, NJ, USA.

PENNINGTON, J., SOCHER, R., AND MANNING, C. D. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, A. Moschitti, B. Pang, and W. Daelemans, Eds. 1532–1543.

PIECH, C., HUANG, J., NGUYEN, A., PHULSUKSOMBATI, M., SAHAMI, M., AND GUIBAS, L. 2015. Learning program embeddings to propagate feedback on student code. In Proceedings of the 37th International Conference on Machine Learning, F. Bach and D. Blei, Eds. 1093–1102.

PIECH, C., SAHAMI, M., HUANG, J., AND GUIBAS, L. 2015. Autonomously generating hints by inferring problem solving policies. In Proceedings of the Second ACM Conference on Learning @ Scale, G. Kiczales, D. Russell, and B. Woolf, Eds. 195–204.

POLITI, A. 2013. Lyapunov exponent. Scholarpedia 8, 3, 2722.

PRICE, T., ZHI, R., AND BARNES, T. 2017. Evaluation of a data-driven feedback algorithm for open-ended programming. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, and P. Inventado, Eds. 192–197.

PRICE, T. W., DONG, Y., ZHI, R., PAASSEN, B., LYTLE, N., CATETÉ, V., AND BARNES, T. 2019. A comparison of the quality of data-driven programming hint generation algorithms. International Journal of Artificial Intelligence in Education 29, 3, 368–395.

QIAN, Y. AND LEHMAN, J. 2017. Students' misconceptions and other difficulties in introductory programming: A literature review. ACM Transactions on Computing Education 18, 1, 1–25.

REDDY, S., LABUTOV, I., AND JOACHIMS, T. 2016. Latent skill embedding for personalized lesson sequence recommendation. arXiv 1602.07029.

RIVERS, K. AND KOEDINGER, K. R. 2012. A canonicalizing model for building programming tutors. In Proceedings of the 11th International Conference on Intelligent Tutoring Systems, (ITS 2012), S. A. Cerri, W. J. Clancey, G. Papadourakis, and K. Panourgia, Eds. 591–593.

RIVERS, K. AND KOEDINGER, K. R. 2017. Data-driven hint generation in vast solution spaces: a self-improving python programming tutor. International Journal of Artificial Intelligence in Education 27, 1, 37–64.

ROBINS, A., ROUNTREE, J., AND ROUNTREE, N. 2003. Learning and teaching programming: A review and discussion. Computer Science Education 13, 2, 137–172.

SCARSELLI, F., GORI, M., TSOI, A. C., HAGENBUCHNER, M., AND MONFARDINI, G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20, 1, 61–80.

TAI, K. S., SOCHER, R., AND MANNING, C. D. 2015. Improved semantic representations from tree- structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistic, Y. Matsumoto, C. Zong, and M. Strube, Eds. 1556–1566.

TRUONG, N., ROE, P., AND BANCROFT, P. 2004. Static analysis of students' Java programs. In Proceedings of the Sixth Australasian Conference on Computing Education, R. Lister and A. Young, Eds. 317–325.

WILES, R., DURRANT, G., BROE, S. D., AND POWELL, J. 2009. Methodological approaches at PhD and skills sought for research posts in academia: A mismatch? International Journal of Social Research Methodology 12, 3, 257–269.

ZHANG, K. AND SHASHA, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18, 6, 1245–1262.

ZHI, R., PRICE, T. W., LYTLE, N., DONG, Y., AND BARNES, T. 2018. Reducing the state space of programming problems through data-driven feature detection. In Proceedings of the second Educational Data Mining in Computer Science Education Workshop, D. Azcona, S. Hsiao, N.-T. Le, J. Stamper, and M. Yudelson, Eds.

ZIMMERMAN, K. AND RUPAKHETI, C. R. 2015. An automated framework for recommending program elements to novices (N). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), M. Cohen, L. Grunske, and M. Whalen, Eds. 283–288.
Section
EDM 2021 Journal Track

Most read articles by the same author(s)

<< < 1 2