In this paper, we present a method for recognizing algorithms from students' programming submissions coded in Java. The method is based on the concept of programming schemas and beacons. Schemas are high-level programming knowledge with detailed knowledge abstracted out, and beacons are statements that imply specific structures in a program. The method automatically searches for schemas from the given program and compares the extracted schemas with those from the knowledge base to recognize the algorithm-specific code for further processing. In the next step, several characteristics and beacons specific to the given algorithm are computed from the code. These characteristics and beacons are then used as the learning data and given to the C4.5 algorithm, which builds a classification tree that can be used to classify previously unseen implementations of algorithms. The method and its performance is demonstrated in the case of basic sorting algorithms and their variations implemented both in various learning resources, that is, textbooks and websites, (N = 209 programs) and in genuine student submissions in a first year data structures and algorithms course (N = 159). The empirical study conducted for evaluating the performance of the classification by leave-one-out cross-validation technique shows that the estimated true positive rate is 97.0%. The results demonstrate the feasibility of the idea of recognizing algorithms based on algorithm-specific programming schemas and beacons. The method can be used as white-box analysis to verify that students have implemented the required algorithm and to give feedback to students on their problematic implementations. We discuss the applications of the method from both teachers' and students' point of view.
How to Cite
algorithm recognition, detecting programming schemas, roles of variables, white-box analysis, decision tree classifiers
AMERSHI, S. AND CONATI, C. 2009. Combining unsupervised and supervised classification to build user models for exploratory learning environments. Journal of Educational Data Mining 1, 1, 18-71.
BAKER, B. S. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the Second Working Conference on Reverse Engineering (WCRE '95), 1995. IEEE Computer Society Washington, DC, USA, 86-95.
BASIT, H. A., PUGLISI, S. J., MCMASTER, W. F. S., TURPIN, A., AND JARZABEK, S. 2007. Efficient token based clone detection with flexible tokenization. In Proceedings of the 6th European Software Engineering Conference and Foundations of Software Engineering, ESEC/FSE 2007. ACM New York, NY, USA, 513-516.
BAXTER, I. D., YAHIN, A., MOURA, L., SANT'ANNA, M., AND BIER, L. 1998. Clone detection using abstract syntax trees. In Proceedings of the 14th IEEE International Conference on Software Maintenance, Bethesda, Maryland, USA, 16-19 March, 1998. IEEE Computer Society Washington, DC, USA, 368-377.
BELLON, S., KOSCHKE, R., ANTONIOL, G., KRINKE, J., AND MERLO, E. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9, 577-591.
BEN-ARI, M. AND SAJANIEMI, J. 2004. Roles of variables as seen by CS educators. SIGCSE Bulletin 36, 3, 52-56.
BISHOP, C. AND JOHNSON, C. G. 2005. Assessing roles of variables by program analysis. In Proceedings of the 5th Baltic Sea Conference on Computing Education Research, Koli, Finland, 17-20 November. University of Joensuu, Finland, 131-136.
BROOKS, R. 1983. Towards a theory of the comprehension of computer programs. International Journal of Man-Machine Studies 18, 6, 543-554.
BYCKLING, P. AND SAJANIEMI, J. 2006. Roles of variables and programming skills improvement. SIGCSE Bulletin 38, 1, 413-417.
DÉTIENNE, F. 1990. Expert programming knowledge: A schema-based approach. In Psychology of Programming, J.-M. Hoc, T. R. G. Green, R. Samurcay, and D. J. Gilmore, Eds. Academic Press, London, 205-222.
DIT, B., REVELLE, M., GETHERS, M., AND POSHYVANYK, D. 2013. Feature location in source code: A taxonomy and survey. Journal of Software: Evolution and Process (JSEP) 25, 1, 53-95.
DOMINGUEZ, A. K., YACEF, K., AND CURRAN, J. 2010. Data mining to generate individualised feedback. In Proceedings of the 10th international conference on Intelligent Tutoring Systems (ITS '10)- Volume Part II. Springer-Verlag Berlin, Heidelberg, 303-305.
DRUMMOND, J. AND LITMAN, D. J. 2010. In the zone: Towards detecting student zoning out using supervised machine learning. In Proceedings of the 10th international conference on Intelligent Tutoring Systems (ITS '10)- Volume Part II. Springer-Verlag Berlin, Heidelberg, 306-308.
DUCASSE, S., RIEGER, M., AND DEMEYER, S. 1999. A language independent approach for detecting duplicated code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM '99), 1999. IEEE Computer Society Washington, DC, USA, 109-118.
EDWARDS, D., SIMMONS, S., AND WILDE, N. 2006. An approach to feature location in distributed systems. Journal of Systems and Software 79, 1, 57-68.
EDWARDS, S. H. 2003. Rethinking computer science education from a test-first perspective. In Companion of the 18th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, Anaheim, California, USA, 26-30 October. ACM, New York, NY, USA, 148-155.
EISENBERG, A. D. AND VOLDER, K. D. 2005. Dynamic feature traces: Finding features in unfamiliar code. In Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM '05), 2005. IEEE Computer Society Washington, DC, USA, 337-346.
GAY, G., HAIDUC, S., MARCUS, A., AND MENZIES, T. 2009. On the use of relevance feedback in ir-based concept location. In Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM '09), Edmonton, Canada, 20-26 September, 2009. Computer Society Washington, DC, USA, 351-360.
GERDT, P. AND SAJANIEMI, J. 2004. An approach to automatic detection of variable roles in program animation. In Proceedings of the 3th Program Visualization Workshop, the University of Warwick, UK, 1-2 July. The University of Warwick, UK, 86-93.
GERDT, P. AND SAJANIEMI, J. 2006. A web-based service for the automatic detection of roles of variables. In Proceedings of the 11th annual SIGCSE conference on Innovation and technology in computer science education, Bologna, Italy, 26-28 June. ACM, New York, NY, USA, 178-182.
HALSTEAD, M. 1977. Elements of Software Science. Elsevier Science Inc, New York, NY, USA. HARANDI, M. AND NING, J. 1990. Knowledge-based program analysis. Software IEEE 7, 4, 74-81.
HIGGINS, C., SYMEONIDIS, P., AND TSINTSIFAS, A. 2002. The marking system for CourseMaster. In Proceedings of the 7th annual conference on Innovation and Technology in Computer Science Education, Aarhus, Denmark, 24-26 June. ACM, New York, NY, USA, 46-50.
IHANTOLA, P., KARAVIRTA, V., SEPPÄLÄ, O., AND AHONIEMI, T. 2010. Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research (Koli Calling 2010).
JOHNSON, W. L. AND SOLOWAY, E. 1984. Proust: Knowledge-based program understanding. In Proceedings of the 7th international conference on Software engineering, Orlando, Florida, USA, 26-29 March. IEEE Press Piscataway, NJ, USA, 369-380.
JOY, M., GRIFFITHS, N., AND BOYATT, R. 2005. The BOSS online submission and assessment system. ACM Journal on Educational Resources in Computing 5, 3, 1-28.
KOMONDOOR, R. AND HORWITZ, S. 2001. Using slicing to identify duplication in source code. In Proceedings of the 8th International Symposium on Static Analysis, SAS 2001, Paris, France, 16-18 July, 2001. Springer, 40-56.
KONTOGIANNIS, K. A., DEMORI, R., MERLO, E. M., GALLER, M., AND BERNSTEIN, M. 1996. Pattern matching for clone and concept detection. Journal of Automated Software Engineering 3, 1-2, 77-108.
KRINKE, J. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE '01), 2001. IEEE Computer Society Washington, DC, USA, 301-309.
KUITTINEN, M. AND SAJANIEMI, J. 2004. Teaching roles of variables in elementary programming courses. In Proceedings of the 9th annual SIGCSE conference on Innovation and technology in computer science education (Leeds, United Kingdom, 2004). 57-61.
LABEKE, N. V., POULOVASSILIS, A., AND MAGOULAS, G. 2008. Using similarity metrics for matching lifelong learners. In Proceedings of the 9th international conference on Intelligent Tutoring Systems (ITS '08). Springer-Verlag Berlin, Heidelberg, 142-151.
LIU, D., MARCUS, A., POSHYVANYK, D., AND RAJLICH, V. 2007. Feature location via information retrieval based filtering of a single scenario execution trace. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering (ASE '07),Atlanta, Georgia, 5-9 November, 2007. ACM New York, NY, USA, 234-243.
MALMI, L., KARAVIRTA, V., KORHONEN, A., NIKANDER, J., SEPPÄLÄ, O., AND SILVASTI, P. 2004. Visual algorithm simulation exercise system with automatic assessment: Trakla2. Informatics in Education 3, 2, 267-288.
MARCUS, A. AND MALETIC, J. I. 2001. Identification of high-level concept clones in source code. In Proceedings of the 16th IEEE International Conference on Automated Software Engineering, San Diego, California, 26- 29 November. IEEE, Washington, DC, USA, 107-114.
MARCUS, A., SERGEYEV, A., RAJLICH, V., AND MALETIC, J. I. 2004. An information retrieval approach to concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE '04), Delft, The Netherlands, 8-12 November, 2004. IEEE Computer Society Washington, DC, USA, 214-223.
MAYRAND, J., LEBLANC, C., AND MERLO, E. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM '96), November 1996. IEEE Computer Society Washington, DC, USA, 244-254.
MCCABE, T. J. 1976. A complexity measure. IEEE Transactions on Software Engineering SE-2, 308-320.
METZGER, R. AND WEN, Z. 2000. Automatic Algorithm Recognition and Replacement: A New Approach to Program Optimization. The MIT Press.
MIKSATKO, J. AND MCLAREN, B. M. 2008. What's in a cluster? automatically detecting interesting interactions in student e-discussions. In Proceedings of the 9th international conference on Intelligent Tutoring Systems (ITS '08). Springer-Verlag Berlin, Heidelberg, 333-342.
PENNINGTON, N. 1987. Comprehension strategies in programming. Empirical studies of programmers: second workshop, 100-113.
POSHYVANYK, D., GETHERS, M., AND MARCUS, A. 2012. Concept location using formal concept analysis and information retrieval. ACM Transactions on Software Engineering and Methodology (TOSEM) 21, 4.
POSHYVANYK, D. AND MARCUS, A. 2007a. Combining formal concept analysis with information retrieval for concept location in source code. In Proceedings of the 15th International Conference on Program Comprehension (ICPC '07), Banff, Alberta, Canada, 26-29 June, 2007. IEEE Computer Society Washington, DC, USA, 37-48.
POSHYVANYK, D. AND MARCUS, A. 2007b. Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Transactions on Software Engineering 33, 6, 420-432.
QUILICI, A. 1994. A memory-based approach to recognizing programming plans. Communications of the ACM 37, 5, 84-93.
QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, USA.
ROY, C. K., CORDY, J. R., AND KOSCHKE, R. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7, 470-495.
SAJANIEMEI, J. AND PRIETO, R. N. 2005. Roles of variables in experts' programming knowledge. In Proceedings of the 17th Annual Workshop on the Psychology of Programming Interest Group (PPIG '05), University of Sussex, UK.
SAJANIEMI, J. 2002. An empirical analysis of roles of variables in novice-level procedural programs. In Proceedings of the IEEE 2002 Symposia on Human Centric Computing Languages and Environments, Arlington, Virginia, USA, 3-6 September. IEEE Computer Society Washington, DC, USA, 37-39.
SAJANIEMI, J., BEN-ARI, M., BYCKLING, P., GERDT, P., AND KULIKOVA, Y. 2006. Roles of variables in three programming paradigms. Computer Science Education 16, 4, 261-279.
SAJANIEMI, J. AND KUITTINEN, M. 2005. An experiment on using roles of variables in teaching introductory programming. Computer Science Education 15, 1, 59-82.
SEPPÄLÄ, O., MALMI, L., AND KORHONEN, A. 2006. Observations on student misconceptions - a case study of the build-heap algorithm. Computer Science Education 16, 3, 241-255.
SOLOWAY, E. AND EHRLICH, K. 1984. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering 10, 5, 595-609.
STOREY, M.-A. 2006. Theories, tools and research methods in program comprehension: past, present and future. Software Quality Journal 14, 3, 187-208.
TAHERKHANI, A. 2011a. Automatic algorithm recognition based on programming schemas. In Proceedings of the 23th Annual Workshop on the Psychology of Programming Interest Group (PPIG'11), University of York, UK, 6-8 September, 2011.
TAHERKHANI, A. 2011b. Using decision tree classifiers in source code analysis to recognize algorithms: An experiment with sorting algorithms. The Computer Journal 54, 11, 1845-1860.
TAHERKHANI, A. 2012. Schema detection and beacon-based classification for algorithm recognition. In Proceedings of the 24th Annual Workshop on the Psychology of Programming Interest Group (PPIG'12), London Metropolitan University, London, UK, 21-23 November, 2012.
TAHERKHANI, A., KORHONEN, A., AND MALMI, L. 2012a. Automatic recognition of students' sorting algorithm implementations in a data structures and algorithms course. In Proceedings of the 12th Koli Calling International Conference on Computing Education Research (Koli Calling 2012), Tahko, Finland, 15-18 November, 2012. ACM New York, NY, USA, 83-92.
TAHERKHANI, A., KORHONEN, A., AND MALMI, L. 2012b. Categorizing variations of student-implemented sorting algorithms. Computer Science Education 22, 2, 109-138.
TAN, P.-N., STEINBACH, M., AND KUMAR, V. 2006. Introduction to Data Mining. Addison-Wesley, USA.
TIARKS, R., KOSCHKE, R., AND FALKE, R. 2011. An extended assessment of type-3 clones as detected by state-of-the-art tools. Software Quality Control archive 19, 2, 295-331.
YANG, W. 1991. Identifying syntactic differences between two programs. Software-Practice and Experience 21, 7, 739-755.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.