Beacon- and Schema-Based Method for Recognizing Algorithms from Students’ Source Code

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Sep 1, 2013
Ahmad Taherkhani Lauri Malmi

Abstract

In this paper, we present a method for recognizing algorithms from students programming submissions
coded in Java. The method is based on the concept of programming schemas and beacons. Schemas are highlevel programming knowledge with detailed knowledge abstracted out, and beacons are statements that
imply speci?c structures in a program. The method automatically searches for schemas from the given program and compares the extracted schemas with those from the knowledge base to recognize the algorithmspeci?c code for further processing. In the next step, several characteristics and beacons speci?c to the given algorithm are computed from the code. These characteristics and beacons are then used as the learning data and given to the C4.5 algorithm, which builds a classi?cation tree that can be used to classify previously unseen implementations of algorithms.

The method and its performance is demonstrated in the case of basic sorting algorithms and their variations implemented both in various learning resources, that is, textbooks and websites, (N = 209 programs) and in genuine student submissions in a ?rst year data structures and algorithms course (N = 159). The empirical study conducted for evaluating the performance of the classi?cation by leave-one-out cross-validation technique shows that the estimated true positive rate is 97.0%. The results demonstrate the feasibility of the idea of recognizing algorithms based on algorithm-speci?c programming schemas and beacons. The method can be used as white-box analysis to verify that students have implemented the required algorithm and to give feedback to students on their problematic implementations. We discuss the applications of the method from both teachers and students point of view.

How to Cite

Taherkhani, A., & Malmi, L. (2013). JEDM | Journal of Educational Data Mining, 5(2), 69-101. Retrieved from https://jedm.educationaldatamining.org/index.php/JEDM/article/view/5
Abstract 405 | PDF Downloads 105

##plugins.themes.bootstrap3.article.details##

References
ALA-MUTKA , K. 2005. A survey of automated assessment approaches for programming assignments. Computer Science Education 15, 2, 83–102.

AMERSHI , S. AND CONATI , C. 2009. Combining unsupervised and supervised classification to build user models for exploratory learning environments. Journal of Educational Data Mining 1, 1, 18–71.

BAKER , B. S. 1995. On finding duplication and near-duplication in large software systems. In Proceedings
of the Second Working Conference on Reverse Engineering (WCRE ’95), 1995. IEEE Computer Society
Washington, DC, USA, 86–95.

BASIT, H. A., PUGLISI , S. J., M C MASTER , W. F. S., TURPIN, A., AND JARZABEK , S. 2007. Efficient token
based clone detection with flexible tokenization. In Proceedings of the 6th European Software Engineering
Conference and Foundations of Software Engineering, ESEC/FSE 2007. ACM New York, NY, USA, 513–
516.

BAXTER , I. D., YAHIN, A., MOURA , L., SANT ’ANNA , M., AND BIER , L. 1998. Clone detection using ab-
stract syntax trees. In Proceedings of the 14th IEEE International Conference on Software Maintenance,
Bethesda, Maryland, USA, 16–19 March, 1998. IEEE Computer Society Washington, DC, USA, 368–377.

BELLON, S., KOSCHKE , R., ANTONIOL , G., KRINKE , J., AND MERLO, E. 2007. Comparison and evaluation of
clone detection tools. IEEE Transactions on Software Engineering 33, 9, 577–591.

BEN -ARI , M. AND SAJANIEMI , J. 2004. Roles of variables as seen by CS educators. SIGCSE Bulletin 36, 3,
52–56.

BISHOP, C. AND JOHNSON, C. G. 2005. Assessing roles of variables by program analysis. In Proceedings of the
5th Baltic Sea Conference on Computing Education Research, Koli, Finland, 17–20 November. University
of Joensuu, Finland, 131–136.

BROOKS, R. 1983. Towards a theory of the comprehension of computer programs. International Journal of
Man-Machine Studies 18, 6, 543–554.

BYCKLING, P. AND SAJANIEMI , J. 2006. Roles of variables and programming skills improvement. SIGCSE
Bulletin 38, 1, 413–417.
?

DETIENNE , F. 1990. Expert programming knowledge: A schema-based approach. In Psychology of Programming, J.-M. Hoc, T. R. G. Green, R. Samurcay, and D. J. Gilmore, Eds. Academic Press, London, 205–222.

DIT, B., REVELLE , M., GETHERS, M., AND POSHYVANYK , D. 2013. Feature location in source code: A taxon-
omy and survey. Journal of Software: Evolution and Process (JSEP) 25, 1, 53–95.

DOMINGUEZ , A. K., YACEF, K., AND CURRAN, J. 2010. Data mining to generate individualised feedback. In
Proceedings of the 10th international conference on Intelligent Tutoring Systems (ITS ’10) - Volume Part II.
Springer-Verlag Berlin, Heidelberg, 303–305.

DRUMMOND, J. AND LITMAN, D. J. 2010. In the zone: Towards detecting student zoning out using supervised
machine learning. In Proceedings of the 10th international conference on Intelligent Tutoring Systems (ITS
’10) - Volume Part II. Springer-Verlag Berlin, Heidelberg, 306–308.

DUCASSE , S., RIEGER , M., AND DEMEYER , S. 1999. A language independent approach for detecting duplicated
code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM ’99), 1999.
IEEE Computer Society Washington, DC, USA, 109–118.

EDWARDS, D., SIMMONS, S., AND WILDE , N. 2006. An approach to feature location in distributed systems.
Journal of Systems and Software 79, 1, 57–68.

EDWARDS, S. H. 2003. Rethinking computer science education from a test-first perspective. In Companion
of the 18th annual ACMSIGPLAN conference on Object-oriented programming, systems, languages, and
applications, Anaheim, California, USA, 26–30 October. ACM, New York, NY, USA, 148–155.

EISENBERG, A. D. AND VOLDER , K. D. 2005. Dynamic feature traces: Finding features in unfamiliar code. In
Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM ’05), 2005. EEE
Computer Society Washington, DC, USA, 337–346.

GAY, G., HAIDUC, S., MARCUS, A., AND MENZIES, T. 2009. On the use of relevance feedback in ir-based con-
cept location. In Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM
’09), Edmonton, Canada, 20–26 September, 2009. EEE Computer Society Washington, DC, USA, 351–360.

GERDT, P. AND SAJANIEMI , J. 2004. An approach to automatic detection of variable roles in program animation. In Proceedings of the 3th Program Visualization Workshop, the University of Warwick, UK, 1–2 July.
The University of Warwick, UK, 86–93.

GERDT, P. AND SAJANIEMI , J. 2006. A web-based service for the automatic detection of roles of variables.
In Proceedings of the 11th annual SIGCSE conference on Innovation and technology in computer science
education, Bologna, Italy, 26–28 June. ACM, New York, NY, USA, 178–182.

HALSTEAD, M. 1977. Elements of Software Science. Elsevier Science Inc, New York, NY, USA.

HARANDI , M. AND NING, J. 1990. Knowledge-based program analysis. Software IEEE 7, 4, 74–81.

HIGGINS, C., SYMEONIDIS, P., AND TSINTSIFAS, A. 2002. The marking system for CourseMaster. In Proceed-
ings of the 7th annual conference on Innovation and Technology in Computer Science Education, Aarhus,
Denmark, 24–26 June. ACM, New York, NY, USA, 46–50.

IHANTOLA , P., KARAVIRTA , V., SEPP AL A , O., AND AHONIEMI , T. 2010. Review of recent systems for automatic
assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference
on Computing Education Research (Koli Calling 2010).

JOHNSON, W. L. AND SOLOWAY, E. 1984. Proust: Knowledge-based program understanding. In Proceedings
of the 7th international conference on Software engineering, Orlando, Florida, USA, 26–29 March. IEEE
Press Piscataway, NJ, USA, 369–380.

JOY, M., GRIFFITHS, N., AND BOYATT, R. 2005. The BOSS online submission and assessment system. ACM
Journal on Educational Resources in Computing 5, 3, 1–28.

KOMONDOOR , R. AND HORWITZ , S. 2001. Using slicing to identify duplication in source code. In Proceedings of
the 8th International Symposium on Static Analysis, SAS 2001, Paris, France, 16–18 July, 2001. Springer,
40–56.

KONTOGIANNIS, K. A., D E MORI , R., MERLO, E. M., GALLER , M., AND BERNSTEIN, M. 1996. Pattern matching for clone and concept detection. Journal of Automated Software Engineering 3, 1–2, 77–108.

KRINKE , J. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working
Conference on Reverse Engineering (WCRE ’01), 2001. IEEE Computer Society Washington, DC, USA,
301–309.

KUITTINEN, M. AND SAJANIEMI , J. 2004. Teaching roles of variables in elementary programming courses.
In Proceedings of the 9th annual SIGCSE conference on Innovation and technology in computer science
education (Leeds, United Kingdom, 2004). 57–61.

LABEKE , N. V., POULOVASSILIS, A., AND MAGOULAS, G. 2008. Using similarity metrics for matching life-
long learners. In Proceedings of the 9th international conference on Intelligent Tutoring Systems (ITS ’08).
Springer-Verlag Berlin, Heidelberg, 142–151.

LIU, D., MARCUS, A., POSHYVANYK , D., AND RAJLICH , V. 2007. Feature location via information retrieval
based filtering of a single scenario execution trace. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering (ASE ’07),Atlanta, Georgia, 5–9 November, 2007.
ACM New York, NY, USA, 234–243.
? ?

MALMI , L., KARAVIRTA , V., KORHONEN, A., NIKANDER , J., SEPP AL A , O., AND SILVASTI , P. 2004. Visual
algorithm simulation exercise system with automatic assessment: Trakla2. Informatics in Education 3, 2,
267–288.

MARCUS, A. AND MALETIC, J. I. 2001. Identification of high-level concept clones in source code. In Proceedings
of the 16th IEEE International Conference on Automated Software Engineering, San Diego, California, 26–
29 November. IEEE, Washington, DC, USA, 107–114.

MARCUS, A., SERGEYEV, A., RAJLICH , V., AND MALETIC, J. I. 2004. An information retrieval approach to
concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering
(WCRE ’04), Delft, The Netherlands, 8–12 November, 2004. EEE Computer Society Washington, DC, USA,
214–223.

MAYRAND, J., LEBLANC, C., AND MERLO, E. 1996. Experiment on the automatic detection of function clones
in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM ’96), November 1996. IEEE Computer Society Washington, DC, USA, 244–254.

MC CABE , T. J. 1976. A complexity measure. IEEE Transactions on Software Engineering SE-2, 308–320.

METZGER , R. AND WEN, Z. 2000. Automatic Algorithm Recognition and Replacement: A New Approach to
Program Optimization. The MIT Press.

MIKSATKO, J. AND MCLAREN, B. M. 2008. What’s in a cluster? automatically detecting interesting interac-
tions in student e-discussions. In Proceedings of the 9th international conference on Intelligent Tutoring
Systems (ITS ’08). Springer-Verlag Berlin, Heidelberg, 333–342.

PENNINGTON, N. 1987. Comprehension strategies in programming. Empirical studies of programmers: second
workshop, 100–113.

POSHYVANYK , D., GETHERS, M., AND MARCUS, A. 2012. Concept location using formal concept analysis and
information retrieval. ACM Transactions on Software Engineering and Methodology (TOSEM) 21, 4.

POSHYVANYK , D. AND MARCUS, A. 2007a. Combining formal concept analysis with information retrieval for
concept location in source code. In Proceedings of the 15th International Conference on Program Comprehension (ICPC ’07), Banff, Alberta, Canada, 26–29 June, 2007. EEE Computer Society Washington, DC,
USA, 37–48.

POSHYVANYK , D. AND MARCUS, A. 2007b. Feature location using probabilistic ranking of methods based on
execution scenarios and information retrieval. IEEE Transactions on Software Engineering 33, 6, 420–432.

QUILICI , A. 1994. A memory-based approach to recognizing programming plans. Communications of the
ACM 37, 5, 84–93.

QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, USA.

ROY, C. K., CORDY, J. R., AND KOSCHKE , R. 2009. Comparison and evaluation of code clone detection tech-
niques and tools: A qualitative approach. Science of Computer Programming 74, 7, 470–495.

SAJANIEMEI , J. AND PRIETO, R. N. 2005. Roles of variables in experts’ programming knowledge. In Proceedings of the 17th Annual Workshop on the Psychology of Programming Interest Group (PPIG ’05), University
of Sussex, UK.

SAJANIEMI , J. 2002. An empirical analysis of roles of variables in novice-level procedural programs. In Proceedings of the IEEE 2002 Symposia on Human Centric Computing Languages and Environments, Arling-
ton, Virginia, USA, 3–6 September. IEEE Computer Society Washington, DC, USA, 37–39.

SAJANIEMI , J., BEN -ARI , M., BYCKLING, P., GERDT, P., AND KULIKOVA , Y. 2006. Roles of variables in three
programming paradigms. Computer Science Education 16, 4, 261–279.

SAJANIEMI , J. AND KUITTINEN, M. 2005. An experiment on using roles of variables in teaching introductory
programming. Computer Science Education 15, 1, 59–82.
? ?

SEPPAL A , O., MALMI , L., AND KORHONEN, A. 2006. Observations on student misconceptions – a case study
of the build-heap algorithm. Computer Science Education 16, 3, 241–255.

SOLOWAY, E. AND EHRLICH , K. 1984. Empirical studies of programming knowledge. IEEE Transactions on
Software Engineering 10, 5, 595–609.

STOREY, M.-A. 2006. Theories, tools and research methods in program comprehension: past, present and
future. Software Quality Journal 14, 3, 187–208.

TAHERKHANI , A. 2011a. Automatic algorithm recognition based on programming schemas. In Proceedings of
the 23th Annual Workshop on the Psychology of Programming Interest Group (PPIG’11), University of York,
UK, 6-8 September, 2011.

TAHERKHANI , A. 2011b. Using decision tree classifiers in source code analysis to recognize algorithms: An
experiment with sorting algorithms. The Computer Journal 54, 11, 1845–1860.

TAHERKHANI , A. 2012. Schema detection and beacon-based classification for algorithm recognition. In Proceedings of the 24th Annual Workshop on the Psychology of Programming Interest Group (PPIG’12), London
Metropolitan University, London, UK, 21-23 November, 2012.

TAHERKHANI , A., KORHONEN, A., AND MALMI , L. 2012a. Automatic recognition of students’ sorting algo-
rithm implementations in a data structures and algorithms course. In Proceedings of the 12th Koli Calling
International Conference on Computing Education Research (Koli Calling 2012), Tahko, Finland, 15–18
November, 2012. ACM New York, NY, USA, 83–92.

TAHERKHANI , A., KORHONEN, A., AND MALMI , L. 2012b. Categorizing variations of student-implemented
sorting algorithms. Computer Science Education 22, 2, 109–138.

TAN, P.-N., STEINBACH , M., AND KUMAR , V. 2006. Introduction to Data Mining. Addison-Wesley, USA.

TIARKS, R., KOSCHKE , R., ANDFALKE , R. 2011. An extended assessment of type-3 clones as detected by
state-of-the-art tools. Software Quality Control archive 19, 2, 295–331.

YANG, W. 1991. Identifying syntactic differences between two programs. Software–Practice and Experience 21, 7, 739–755.
Section
Articles