WordBytes: Exploring an Intermediate Constraint Format for Rapid Classification of Student Answers on Constructed Response Assessments



Published Dec 23, 2017
Kerry J Kim Denise S Pope Daniel Wendel Eli Meir


Computerized classification of student answers offers the possibility of instant feedback and improved learning.  However, there are tradeoffs between formative assessment and ease of classification with different question types.  Open response (OR) questions provide greater insight into student thinking and understanding than more constrained multiple choice (MC) questions, but development of automated classifiers is more difficult, often requiring training a machine learning system with many human-classified answers.  Here we explore a novel intermediate-constraint question format called WordBytes (WB) where students assemble one-sentence answers to two different college evolutionary biology questions by choosing, then ordering, fixed tiles containing words and phrases.  We found WB allowed students to construct hundreds to thousands of different answers, with multiple ways to express correct answers and incorrect answers with different misconceptions.  WB offers the possibility of more rapid development of classifiers, as we found humans could specify rules for an automated grader that could accurately classify answers as correct/incorrect with Cohen’s kappa of 0.89 or higher, near the measured intra-rater reliability of two human graders and the performance of machine classification of OR answers (Nehm et al. 2012).  Finer-grained classification to identify specific misconception had much lower accuracy (Cohen’s kappa < 0.70), which could be improved either by using a machine learner or human rules, but both required inspecting and classifying many student answers.  We thus find that the intermediate constraints of our WB format allows the possibility of accurate grading of the correctness without the labor-intensive step of collecting hundreds of student answers.

How to Cite

Kim, K. J., Pope, D. S., Wendel, D., & Meir, E. (2017). WordBytes: Exploring an Intermediate Constraint Format for Rapid Classification of Student Answers on Constructed Response Assessments. JEDM | Journal of Educational Data Mining, 9(2), 45-71. Retrieved from https://jedm.educationaldatamining.org/index.php/JEDM/article/view/209
Abstract 848 | PDF Downloads 476


AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE (AAAS) (2011) Vision and change in undergraduate biology education. AAAS, Washington, DC.

BEGGROW E. P., HA M., NEHM R. H., PEARL D., BOONE W. J. (2014) Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance? Journal of Science education and Technology, 23:160-182.

BEJAR I. I. (1991) A methodology for scoring open-ended architectural design problems. Journal of Applied Psychology, 76(4):522-532.

BENNETT R. E. (1993) On the meaning of constructed response. In Bennett R. E., Ward W. C. (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment. Lawrence Erlbaum Associates. Hillsdale NJ. 1-27.

BLACK P., WILLIAM D. (1998) Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1):7-74.

CHANG C. C., LIN C. J. (2011) LIBSVM : A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27.

COHEN J. (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37-46.

HA M., NEHM R. H., URBAN-LURAIN M., MERRILL J. E. (2011) Applying Computerized Scoring Models of Written Biological Explanations across Courses and Colleges: Prospects and Limitations. CBE Life Science Education, 10:379.

HA M., NEHM R. H. (2016) The impact of misspelled words on automated computer scoring: a case study of scientific explanations. Journal of Science Education and Technology, 25(3):358.

HERRON J., ABRAHAM J., MEIR E. (2014) Mendelian Pigs. Simbio.com.

HERRON J., MEIR E. (2014) Darwinian Snails. Simbio.com.

HOFMANN M., KLINKENBERG R. (eds) (2013) RapidMiner: Data mining use cases and business analytics applications (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series), CRC Press.

HSU C. W., CHANG C. C., LIN C. J. (2003) A practical guide to support vector classification. https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf

KLEIN S. P. (2008) Characteristics of hand and machine-assigned scores to college students’ answers to open-ended tasks. In Nolan D. Speed T. (Eds.) Probability and statistics: Essays in Honor of David A. Freedman. Beachwood, OH. 76-89.

KRIPPENDORFF K. (1980) Content analysis: An introduction to its methodology. Sage Publications.

LANDIS J. R., KOCH G. G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33:159-174.

LEELAWONG K., BISWAS G. (2008) Designing learning by teaching agents: The Betty’s Brain system. International Journal of Artificial Intelligence in Education, 18(3):181-208.

LUKHOFF B. (2010) The design and validation of an automatically-scored constructed-response item type for measuring graphical representation skill. Doctoral dissertation, Stanford University, Stanford, CA.

LUCKIE D. B., HARRISON S. H., WALLACE J. L., EBERT-MAY D. (2008) Studying C-TOOLS: Automated grading for online concept maps. Conference Proceedings from Conceptual Assessment in Biology II, 2(1):1-13.

MOHARRERI K., HA M., NEHM R. H. (2014) EvoGrader: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7:15.

National Research Council (2001) Knowing what students know: The science and design of educational assessment, Washington DC: National Academies Press.

NEHM R. H., HA M., MAYFIELD E. (2012) Transforming Biology Assessment with Machine Learning: Automated Scoring of Written Evolutionary Explanations. Journal of Science Education Technology, 21:183.

NEHM R. H., HAERTIG H. (2012) Human vs. computer diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1)56-73.

QUINLAN R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.

RITTHOFF O., KLINKENBERG R., MIERSWA I., FELSKE S. (2001) YALE: Yet Another Learning Environment. LLWA’01-Tagungsband der GI-Workshop-Woche Lehren-Lehren-Wissen Adaptivitat. University of Dortmund, Dortmund, Germany. Technical Report, 763:84-92.

ROMERO C., VENTURA S., PECHENIZKLY M., BAKER R. S. (2010) Handbook of Educational Data Mining. CRC Press.

SCALISE K., GIFFORD B. (2006) Computer based assessment in E-Learning: A framework for constructing “intermediate constraint” questions and tasks for technology platforms. Journal of Technology, Learning, and Assessment, 4(6):4-44.

SHUTE V. J. (2008) Focus on Formative Feedback. Review of Education Research, 78(1) 153-189.

SMITH M. K., WOOD W. B., KNIGHT J. K. (2008) The genetics concept assessment: A new concept inventory for gauging student understanding of genetics. CBE Life Sciences Education, 7(4):422-430.

THE CARNEGIE CLASSIFICATION OF INSTITUTIONS OF HIGHER EDUCATION (n.d.) About Carnegie Classification. Retrieved (Dec 15, 2016) from http://carnegieclassifications.iu.edu/ .

VOSNIADOU S. (2008) Conceptual Change Research: An Introduction. In Stella Vosniadou, ed. International Handbook of Research on Conceptual Change. 1st ed. New York/Abingdon: Routeledge, pp.xiii-xxviii.

YANG Y., BUCKENDAHL C. W., JUSZKIEWICZ P. J., BHOLA D. S. (2002) A review of strategies for validating computer automated scoring. Applied Measurement of Education, 15(4):391-412.