A Comparison of Educational Statistics and Data Mining Approaches to Identify Characteristics that Impact Online Learning



Published Oct 23, 2015
L. Dee Miller Leen-Kiat Soh Ashok Samal Kevin Kupzyk Gwen Nugent


Learning objects (LOs) are important online resources for both learners and instructors and usage for LOs is growing. Automatic LO tracking collects large amounts of metadata about individual students as well as data aggregated across courses, learning objects, and other demographic characteristics (e.g. gender). The challenge becomes identifying which of the many variables derived from tracked data are useful for predicting student learning. This challenge has prompted considerable research in the field of educational data mining and learning analytics. This work advances such research in four ways. First, we bring together two approaches for finding salient variables from separate research areas: hierarchical linear modeling (HLM) from education and Lasso feature selection from computer science. Second, we show that these two approaches have complimentary and synergistic results with some variables considers salient by both and others salient by only one. Third, and most importantly, we demonstrate the benefits of a combined approach that considers a variable salient when either HLM or Lasso consider that variable salient. This combined approach both improves model predictive accuracy and finds additional variables considered salient in previous datasets on student learning. Lastly, we use the results to provide insights into the salient variables to the learning outcome in undergraduate CS education. Overall, this work suggests a combined approach that improves the identification of salient variables in big data and also improves the design of LO tracking systems for learning management systems.

How to Cite

Miller, L. D., Soh, L.-K. ., Samal, A., Kupzyk, K., & Nugent, G. (2015). A Comparison of Educational Statistics and Data Mining Approaches to Identify Characteristics that Impact Online Learning. Journal of Educational Data Mining, 7(3), 117–150. https://doi.org/10.5281/zenodo.3554731
Abstract 727 | PDF Downloads 1083



learning object tracking, predicting student learning, hierarchical linear modeling (HLM), lasso feature selection

ALFONS, A. 2012. cvTools: Cross-validation tools for regression models. R package version 0.3.2.

ALVARADO, B., ZUNZUNEGUI, M., DELISLE, H., AND OSORNO, J. 2005. Growth trajectories are influenced by breast-feeding and infant health in an afro-colombian community. Journal of Nutrition, 2171– 2178.

BAKER, R. 2010. International Encyclopedia of Education (3rd edition). Oxford, UK: Elsevier, Chapter Data mining in education.

BERGIN, S., REILLY, R., AND TRAYNOR, D. 2005. Examining the role of self-regulated learning on introductory programming performance. In Proceedings of the 1st international workshop on Computing education research. 81–86.

BERK, J. 2004. The state of learning analytics. T&D, 34–39.

BIENKOWSKI, M., FENG, M., AND MEANS, B. 2012. Enhancing teaching and learning through educational data mining and learning analytics: An issue brief. Tech. rep., U.S. Department of Education.

CHEN, C. 2002. Self-regulated learning strategies and achievement in an introduction to information systems course. Information Technology, Learning, and Performance Journal 20, 11–23.

COHEN, J., COHEN, P., WEST, S., AND AIKEN, L. 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd edition). Mahwah, NJ: Lawrence Earlbaum Associates, Inc.

CREDE, M., PHILLIPS, L. A. 2011. A meta-analytic review of the Motivated Strategies for Learning Questionnaire. Learning and Individual Differences 21, 337-346.

DAVIS, J., PENSKY, M., AND CRAMPTON, W. 2011. Bayesian feature selection for classification with possibly large number of classes. Journal of Statistical Planning and Inference 141, 3256–3266.

DELEN, D. 2009. Analysis of cancer data: A data mining approach. Expert Systems 26, 100–112. EDFacts. 2014. The edfacts initiative. U.S. Department of Education.

FERRON, J., BELL, B., HESS, M., RENDINA-GOBIOFF, G., AND HIBBARD, S. 2009. Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches. Behavior Research Methods 41, 372–384.

FRANCIA, G. 2003. A tale of two learning objects. Journal of Educational Technology Systems 3, 117– 190.

FRIEDMAN, J., HASTIE, T., HOFLING, H., AND TIBSHIRANI, R. 2007. Pathwise coordinate optimization. The Annals of Applied Statistics 1, 302–332.

GRAVETTER, F. AND WALLNAU, L. 2004. Statistics for the Behavioral Sciences (6th edition). Belmont: Wadsworth/Thomson Learning.

HALKITIS, P., PALAMAR, J., AND MUKHERJEE, P. 2008. Analysis of HIV medication adherence in relation to person and treatment characteristics using hierarchical linear modeling. AIDS Patient Care and STDs 22, 323–335.

HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2011. The Elements of Statistical Learning (2nd edition). Springer-Verlag.

HERNANDEZ-LLOREDA, M., COLMENARES, F., AND MARTINEZ-ARIAS, R. 2004. Application of piecewise hierarchical linear growth modeling to the study of continuity in behavioral development of baboons (papio hamadryas). Journal of Comparative Psychology 118, 316–324.

HINDMAN, A., SKIBBE, L., AND ZIMMERMAN, M. 2010. Ecological contexts and early learning: Contributions of child, family, and classroom factors during head start, to literacy and mathematics growth through first grade. Early Childhood Research Quarterly 25, 235–250.

HOFMANN, D. AND GAVIN, M. 1998. Centering Decisions in Hierarchical Linear Models: Implications for Research Organizations. Journal of Management 24, 623–641.

KAHN, J. 2011. Multilevel modeling: overview and applications to research in counseling psychology. Journal of Counseling Psychology 58, 257–271.

KRUGER, A., MERCERON, A., AND WOLF, B. 2010. A data model to ease analysis and mining of educational data. In 3rd International Conference on Educational Data Mining (EDM). 131–140.

LOCKER, L., HOFFMAN, L., AND BOVAIRD, J. 2007. On the use of multilevel modeling as an alternative to items analysis in psycholinguistic research. Behavior Research Methods 39, 723–730.

MAYER, R. 2001. Multimedia Learning. New York: Cambridge University Press.

MCGREAL, R. 2004. Online Education Using Learning Objects. Psychology Press.

MCLAREN, B., SCHEUER, O., AND MIKSATKO, J. 2010. Supporting collaborative learning and e-discussions using artificial intelligence techniques. International Journal of Artificial Intelligence in Education 20, 1–46.

MILLER, L., SOH, L.-K., NUGENT, G., KUPZYK, K., MASMALIYEVA, L., AND SAMAL, A. 2011a. Evaluating the use of learning objects in CS1. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education. 57–62.

MILLER, L., SOH, L.-K., NEILSEN, B., LAM, E., SAMAL, A., KUPZYK, K., AND NUGENT, G. 2011b. Revising computer science learning objects from learner interaction data. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education. 45–50.

MILLER, L., SOH, L.-K., NUGENT, G., AND SAMAL, A. 2011c. iLOG: A framework for automatic annotation of learning objects with empirical usage metadata. International Journal of Artificial Intelligence in Education, 215–236. NSF. 2012. Core techniques and technologies for advancing big data science and engineering. National Science Foundation.

NUGENT, G., KUPZYK, K., MILLER, L., MASMALIYEVA, L., SOH, L.-K., AND SAMAL, A. 2011. Learning analytic approach to identify attributes of learners and multimedia instruction that influence learning. In Proceedings of the World Conference on Educational Multimedia, Hypermedia, and Telecommunications. 2021–2028.

NUGENT, G., KUPZYK, K., RILEY, S., MILLER, L., HOSTETLER, J., SOH, L.-K., AND SAMAL, A. 2009. Empirical usage metadata in learning objects. In Proceedings of the Frontiers in Education. 1–8.

NUGENT, G., SOH, L.-K., AND SAMAL, A. 2006. Design, development, and validation of learning objects. Journal of Educational Technology Systems 34, 271–281.

OCHOA, X. AND DUVAL, E. 2009. Relevance ranking metrics for learning objects. IEEE Transactions on Learning Technologies, 34–48.

PAPADIMITRIOU, A., GRIGORIADOU, M., AND GYFTODIMOS, G. 2009. Interactive problem solving support in the adaptive educational hypermedia system mathema. IEEE Transactions on Learning Technologies 2, 93–106.

PICKERING, R. 2002. Statistical aspects of measurement in palliative care. Palliative Medicine 16, 359– 364.

PINTRICH, P., SMITH, D., GARCIA, T., AND MCKEACHIE, W. 1993. Reliability and predictive validity of the motivated strategies for learning questionnaire (MSLQ). Educational and Psychological Measurement 53, 801–813.

PINTRICH, P., SMITH, D., GARCIA, T., AND MCKEACHIE, W. 1999. Ann Arbor, MI: University of Michigan. A Manual for the Use of the Motivated Strategies for Learning Questionnaire. Ann Arbor, MI: University of Michigan.

RAMASWAMI, M. AND BHASKARAN, R. 2009. A study on feature selection techniques in educational data mining. Journal of Computing 1, 7–11.

RAUDENBUSH, S. AND BRYK, A. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods (2nd edition). Newbury Park, CA: Sage.

RILEY, S., MILLER, L., SOH, L.-K., SAMAL, A., AND NUGENT, G. 2009. Intelligent learning object guide (iLOG): A framework for automatic empirically-based metadata generation. In Proceedings of the International Conference on Artificial Intelligence in Education. 515–522.

ROMERO, C., ROMERO, J., LUNA, J., AND VENTURA, S. 2010. Mining rare association rules from e-learning data. In Proceedings of the 3rd International Conference on Educational Data Mining (EDM). 171–180.

ROMERO, C. AND VENTURA, S. 2010. Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics 40, 601–618.

ROMERO, C., VENTURA, S., ESPEJO, P., AND HERVAS, C. 2008. Data mining algorithms to classify students. In Proceedings of the 1st International Conference on Educational Data Mining. 8–17.

ROUNTREE, N., ROUNTREE, J., AND ROBINS, A. 2002. Predictors of success and failure in a CS1 course. In Proceedings of the 33rd SIGCSE technical symposium on Computer Science Education. 121–124.

SAYES, Y., INZA, I., AND LARRANGA, P. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics, 2507–2517.

SCHONFELD, I. AND RINDSKOPF, D. 2007. Hierarchical linear modeling in organizational research longitudinal data outside the context of growth modeling. Organizational Research Methods 10, 417– 429.

SEO, D. AND LI, K. 2009. Effects of college climate on students’ binge drinking: hierarchical generalized linear model. Annals of Behavioral Medicine 38, 262–268.

SHUTE, V. AND TOWLE, B. 2003. Adaptive e-learning. Educational Psychologist 38, 105–114.

SIMON, N., FRIEDMAN, J., HASTIE, T., AND TIBSHIRANI, R. 2011. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39, 1–13.

SNOW, R. 1994. Mind in context: Interactionist perspectives on human intelligence. Cambridge: Cambridge University Press, Chapter Abilities in Academic Tasks.

SORGE, C. 2007. What happens? Relationship of age and gender with science attitudes from elementary to middle school. Science Educator 16, 33–37.

STACK, S. AND KPOSOWA, A. 2008. The association of suicide rates with individual-level suicide attitudes: A cross-national analysis. Social Science Quarterly 89, 39–59.

TERRACCIANO, A., MCCRAE, R., BRANT, L., AND COSTA, P. 2005. Hierarchical linear modeling analyses of the NEO-PI-R scales in the Baltimore longitudinal study of aging. Psychology and Aging 20, 493– 506.

TIBSHIRANI, R. 1996. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society 58, 267–288.

VENTURA, P. 2005. Identifying predictors of success for an objects-first CS1. Computer Science Education 15, 223–243.

WIEDENBECK, S., LABELLE, D., AND KAIN, V. 2004. Factors affecting course outcomes in introductory programming. In 16th Workshop of the Psychology of Programming Interest Group. 97–110.

WILSON, B. AND SHROCK, S. 2001. Contributing to success in an introductory computer science course: a study of twelve factors. In Proceedings of the 32nd SIGCSE technical symposium on Computer Science Education. 184–188.

YUAN, M. AND LIN, Y. 2006. Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society 68, 49–67.

ZHAO, P. AND YU, B. 2006. On model selection consistency of lasso. Journal of Machine Learning Research 7, 2541–2563.