In this paper, we present a method for recognizing algorithms from students programming submissions
coded in Java. The method is based on the concept of programming schemas and beacons. Schemas are highlevel programming knowledge with detailed knowledge abstracted out, and beacons are statements that
imply speci?c structures in a program. The method automatically searches for schemas from the given program and compares the extracted schemas with those from the knowledge base to recognize the algorithmspeci?c code for further processing. In the next step, several characteristics and beacons speci?c to the given algorithm are computed from the code. These characteristics and beacons are then used as the learning data and given to the C4.5 algorithm, which builds a classi?cation tree that can be used to classify previously unseen implementations of algorithms.
The method and its performance is demonstrated in the case of basic sorting algorithms and their variations implemented both in various learning resources, that is, textbooks and websites, (N = 209 programs) and in genuine student submissions in a ?rst year data structures and algorithms course (N = 159). The empirical study conducted for evaluating the performance of the classi?cation by leave-one-out cross-validation technique shows that the estimated true positive rate is 97.0%. The results demonstrate the feasibility of the idea of recognizing algorithms based on algorithm-speci?c programming schemas and beacons. The method can be used as white-box analysis to verify that students have implemented the required algorithm and to give feedback to students on their problematic implementations. We discuss the applications of the method from both teachers and students point of view.
How to Cite
AMERSHI , S. AND CONATI , C. 2009. Combining unsupervised and supervised classification to build user models for exploratory learning environments. Journal of Educational Data Mining 1, 1, 18–71.
BAKER , B. S. 1995. On finding duplication and near-duplication in large software systems. In Proceedings
of the Second Working Conference on Reverse Engineering (WCRE ’95), 1995. IEEE Computer Society
Washington, DC, USA, 86–95.
BASIT, H. A., PUGLISI , S. J., M C MASTER , W. F. S., TURPIN, A., AND JARZABEK , S. 2007. Efficient token
based clone detection with flexible tokenization. In Proceedings of the 6th European Software Engineering
Conference and Foundations of Software Engineering, ESEC/FSE 2007. ACM New York, NY, USA, 513–
BAXTER , I. D., YAHIN, A., MOURA , L., SANT ’ANNA , M., AND BIER , L. 1998. Clone detection using ab-
stract syntax trees. In Proceedings of the 14th IEEE International Conference on Software Maintenance,
Bethesda, Maryland, USA, 16–19 March, 1998. IEEE Computer Society Washington, DC, USA, 368–377.
BELLON, S., KOSCHKE , R., ANTONIOL , G., KRINKE , J., AND MERLO, E. 2007. Comparison and evaluation of
clone detection tools. IEEE Transactions on Software Engineering 33, 9, 577–591.
BEN -ARI , M. AND SAJANIEMI , J. 2004. Roles of variables as seen by CS educators. SIGCSE Bulletin 36, 3,
BISHOP, C. AND JOHNSON, C. G. 2005. Assessing roles of variables by program analysis. In Proceedings of the
5th Baltic Sea Conference on Computing Education Research, Koli, Finland, 17–20 November. University
of Joensuu, Finland, 131–136.
BROOKS, R. 1983. Towards a theory of the comprehension of computer programs. International Journal of
Man-Machine Studies 18, 6, 543–554.
BYCKLING, P. AND SAJANIEMI , J. 2006. Roles of variables and programming skills improvement. SIGCSE
Bulletin 38, 1, 413–417.
DETIENNE , F. 1990. Expert programming knowledge: A schema-based approach. In Psychology of Programming, J.-M. Hoc, T. R. G. Green, R. Samurcay, and D. J. Gilmore, Eds. Academic Press, London, 205–222.
DIT, B., REVELLE , M., GETHERS, M., AND POSHYVANYK , D. 2013. Feature location in source code: A taxon-
omy and survey. Journal of Software: Evolution and Process (JSEP) 25, 1, 53–95.
DOMINGUEZ , A. K., YACEF, K., AND CURRAN, J. 2010. Data mining to generate individualised feedback. In
Proceedings of the 10th international conference on Intelligent Tutoring Systems (ITS ’10) - Volume Part II.
Springer-Verlag Berlin, Heidelberg, 303–305.
DRUMMOND, J. AND LITMAN, D. J. 2010. In the zone: Towards detecting student zoning out using supervised
machine learning. In Proceedings of the 10th international conference on Intelligent Tutoring Systems (ITS
’10) - Volume Part II. Springer-Verlag Berlin, Heidelberg, 306–308.
DUCASSE , S., RIEGER , M., AND DEMEYER , S. 1999. A language independent approach for detecting duplicated
code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM ’99), 1999.
IEEE Computer Society Washington, DC, USA, 109–118.
EDWARDS, D., SIMMONS, S., AND WILDE , N. 2006. An approach to feature location in distributed systems.
Journal of Systems and Software 79, 1, 57–68.
EDWARDS, S. H. 2003. Rethinking computer science education from a test-first perspective. In Companion
of the 18th annual ACMSIGPLAN conference on Object-oriented programming, systems, languages, and
applications, Anaheim, California, USA, 26–30 October. ACM, New York, NY, USA, 148–155.
EISENBERG, A. D. AND VOLDER , K. D. 2005. Dynamic feature traces: Finding features in unfamiliar code. In
Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM ’05), 2005. EEE
Computer Society Washington, DC, USA, 337–346.
GAY, G., HAIDUC, S., MARCUS, A., AND MENZIES, T. 2009. On the use of relevance feedback in ir-based con-
cept location. In Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM
’09), Edmonton, Canada, 20–26 September, 2009. EEE Computer Society Washington, DC, USA, 351–360.
GERDT, P. AND SAJANIEMI , J. 2004. An approach to automatic detection of variable roles in program animation. In Proceedings of the 3th Program Visualization Workshop, the University of Warwick, UK, 1–2 July.
The University of Warwick, UK, 86–93.
GERDT, P. AND SAJANIEMI , J. 2006. A web-based service for the automatic detection of roles of variables.
In Proceedings of the 11th annual SIGCSE conference on Innovation and technology in computer science
education, Bologna, Italy, 26–28 June. ACM, New York, NY, USA, 178–182.
HALSTEAD, M. 1977. Elements of Software Science. Elsevier Science Inc, New York, NY, USA.
HARANDI , M. AND NING, J. 1990. Knowledge-based program analysis. Software IEEE 7, 4, 74–81.
HIGGINS, C., SYMEONIDIS, P., AND TSINTSIFAS, A. 2002. The marking system for CourseMaster. In Proceed-
ings of the 7th annual conference on Innovation and Technology in Computer Science Education, Aarhus,
Denmark, 24–26 June. ACM, New York, NY, USA, 46–50.
IHANTOLA , P., KARAVIRTA , V., SEPP AL A , O., AND AHONIEMI , T. 2010. Review of recent systems for automatic
assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference
on Computing Education Research (Koli Calling 2010).
JOHNSON, W. L. AND SOLOWAY, E. 1984. Proust: Knowledge-based program understanding. In Proceedings
of the 7th international conference on Software engineering, Orlando, Florida, USA, 26–29 March. IEEE
Press Piscataway, NJ, USA, 369–380.
JOY, M., GRIFFITHS, N., AND BOYATT, R. 2005. The BOSS online submission and assessment system. ACM
Journal on Educational Resources in Computing 5, 3, 1–28.
KOMONDOOR , R. AND HORWITZ , S. 2001. Using slicing to identify duplication in source code. In Proceedings of
the 8th International Symposium on Static Analysis, SAS 2001, Paris, France, 16–18 July, 2001. Springer,
KONTOGIANNIS, K. A., D E MORI , R., MERLO, E. M., GALLER , M., AND BERNSTEIN, M. 1996. Pattern matching for clone and concept detection. Journal of Automated Software Engineering 3, 1–2, 77–108.
KRINKE , J. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working
Conference on Reverse Engineering (WCRE ’01), 2001. IEEE Computer Society Washington, DC, USA,
KUITTINEN, M. AND SAJANIEMI , J. 2004. Teaching roles of variables in elementary programming courses.
In Proceedings of the 9th annual SIGCSE conference on Innovation and technology in computer science
education (Leeds, United Kingdom, 2004). 57–61.
LABEKE , N. V., POULOVASSILIS, A., AND MAGOULAS, G. 2008. Using similarity metrics for matching life-
long learners. In Proceedings of the 9th international conference on Intelligent Tutoring Systems (ITS ’08).
Springer-Verlag Berlin, Heidelberg, 142–151.
LIU, D., MARCUS, A., POSHYVANYK , D., AND RAJLICH , V. 2007. Feature location via information retrieval
based filtering of a single scenario execution trace. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering (ASE ’07),Atlanta, Georgia, 5–9 November, 2007.
ACM New York, NY, USA, 234–243.
MALMI , L., KARAVIRTA , V., KORHONEN, A., NIKANDER , J., SEPP AL A , O., AND SILVASTI , P. 2004. Visual
algorithm simulation exercise system with automatic assessment: Trakla2. Informatics in Education 3, 2,
MARCUS, A. AND MALETIC, J. I. 2001. Identification of high-level concept clones in source code. In Proceedings
of the 16th IEEE International Conference on Automated Software Engineering, San Diego, California, 26–
29 November. IEEE, Washington, DC, USA, 107–114.
MARCUS, A., SERGEYEV, A., RAJLICH , V., AND MALETIC, J. I. 2004. An information retrieval approach to
concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering
(WCRE ’04), Delft, The Netherlands, 8–12 November, 2004. EEE Computer Society Washington, DC, USA,
MAYRAND, J., LEBLANC, C., AND MERLO, E. 1996. Experiment on the automatic detection of function clones
in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM ’96), November 1996. IEEE Computer Society Washington, DC, USA, 244–254.
MC CABE , T. J. 1976. A complexity measure. IEEE Transactions on Software Engineering SE-2, 308–320.
METZGER , R. AND WEN, Z. 2000. Automatic Algorithm Recognition and Replacement: A New Approach to
Program Optimization. The MIT Press.
MIKSATKO, J. AND MCLAREN, B. M. 2008. What’s in a cluster? automatically detecting interesting interac-
tions in student e-discussions. In Proceedings of the 9th international conference on Intelligent Tutoring
Systems (ITS ’08). Springer-Verlag Berlin, Heidelberg, 333–342.
PENNINGTON, N. 1987. Comprehension strategies in programming. Empirical studies of programmers: second
POSHYVANYK , D., GETHERS, M., AND MARCUS, A. 2012. Concept location using formal concept analysis and
information retrieval. ACM Transactions on Software Engineering and Methodology (TOSEM) 21, 4.
POSHYVANYK , D. AND MARCUS, A. 2007a. Combining formal concept analysis with information retrieval for
concept location in source code. In Proceedings of the 15th International Conference on Program Comprehension (ICPC ’07), Banff, Alberta, Canada, 26–29 June, 2007. EEE Computer Society Washington, DC,
POSHYVANYK , D. AND MARCUS, A. 2007b. Feature location using probabilistic ranking of methods based on
execution scenarios and information retrieval. IEEE Transactions on Software Engineering 33, 6, 420–432.
QUILICI , A. 1994. A memory-based approach to recognizing programming plans. Communications of the
ACM 37, 5, 84–93.
QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, USA.
ROY, C. K., CORDY, J. R., AND KOSCHKE , R. 2009. Comparison and evaluation of code clone detection tech-
niques and tools: A qualitative approach. Science of Computer Programming 74, 7, 470–495.
SAJANIEMEI , J. AND PRIETO, R. N. 2005. Roles of variables in experts’ programming knowledge. In Proceedings of the 17th Annual Workshop on the Psychology of Programming Interest Group (PPIG ’05), University
of Sussex, UK.
SAJANIEMI , J. 2002. An empirical analysis of roles of variables in novice-level procedural programs. In Proceedings of the IEEE 2002 Symposia on Human Centric Computing Languages and Environments, Arling-
ton, Virginia, USA, 3–6 September. IEEE Computer Society Washington, DC, USA, 37–39.
SAJANIEMI , J., BEN -ARI , M., BYCKLING, P., GERDT, P., AND KULIKOVA , Y. 2006. Roles of variables in three
programming paradigms. Computer Science Education 16, 4, 261–279.
SAJANIEMI , J. AND KUITTINEN, M. 2005. An experiment on using roles of variables in teaching introductory
programming. Computer Science Education 15, 1, 59–82.
SEPPAL A , O., MALMI , L., AND KORHONEN, A. 2006. Observations on student misconceptions – a case study
of the build-heap algorithm. Computer Science Education 16, 3, 241–255.
SOLOWAY, E. AND EHRLICH , K. 1984. Empirical studies of programming knowledge. IEEE Transactions on
Software Engineering 10, 5, 595–609.
STOREY, M.-A. 2006. Theories, tools and research methods in program comprehension: past, present and
future. Software Quality Journal 14, 3, 187–208.
TAHERKHANI , A. 2011a. Automatic algorithm recognition based on programming schemas. In Proceedings of
the 23th Annual Workshop on the Psychology of Programming Interest Group (PPIG’11), University of York,
UK, 6-8 September, 2011.
TAHERKHANI , A. 2011b. Using decision tree classifiers in source code analysis to recognize algorithms: An
experiment with sorting algorithms. The Computer Journal 54, 11, 1845–1860.
TAHERKHANI , A. 2012. Schema detection and beacon-based classification for algorithm recognition. In Proceedings of the 24th Annual Workshop on the Psychology of Programming Interest Group (PPIG’12), London
Metropolitan University, London, UK, 21-23 November, 2012.
TAHERKHANI , A., KORHONEN, A., AND MALMI , L. 2012a. Automatic recognition of students’ sorting algo-
rithm implementations in a data structures and algorithms course. In Proceedings of the 12th Koli Calling
International Conference on Computing Education Research (Koli Calling 2012), Tahko, Finland, 15–18
November, 2012. ACM New York, NY, USA, 83–92.
TAHERKHANI , A., KORHONEN, A., AND MALMI , L. 2012b. Categorizing variations of student-implemented
sorting algorithms. Computer Science Education 22, 2, 109–138.
TAN, P.-N., STEINBACH , M., AND KUMAR , V. 2006. Introduction to Data Mining. Addison-Wesley, USA.
TIARKS, R., KOSCHKE , R., ANDFALKE , R. 2011. An extended assessment of type-3 clones as detected by
state-of-the-art tools. Software Quality Control archive 19, 2, 295–331.
YANG, W. 1991. Identifying syntactic differences between two programs. Software–Practice and Experience 21, 7, 739–755.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term â€śWorkâ€ť shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attributionâ€”other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercialâ€”other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Worksâ€”other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisherâ€™s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisherâ€™s request, the Author agrees to furnish promptly to Publisher, at the Authorâ€™s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Authorâ€™s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Authorâ€™s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisherâ€™s use and publication of any content contained in the Work, including third-party content.