Clustering Educational Digital Library Usage Data: A Comparison of Latent Class Analysis and K-Means Algorithms



Published Jul 22, 2013
Beijie Xu Mimi Recker Xiaojun Qi Nicholas Flann Lei Ye


This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect ( Using a multi-faceted approach and multiple data sources, three types of comparisons of resulting clusters are presented: 1) Davies-Bouldin indices, 2) clustering results validated with user profile data, and 3) cluster evolution. Latent Class Analysis is superior to K-means on all three comparisons. In particular, LCA is more immune to the variance of feature variables, and clustering results turn out well with minimal data transformation. Our research results also show that LCA perform better than K-means in terms of providing the most useful educational interpretation for this dataset.

How to Cite

Xu, B., Recker, M., Qi, X., Flann, N., & Ye, L. (2013). Clustering Educational Digital Library Usage Data: A Comparison of Latent Class Analysis and K-Means Algorithms. Journal of Educational Data Mining, 5(2), 38–68.
Abstract 852 | PDF Downloads 867



educational data mining, educational web mining, clustering, latent class analysis, k-means, digital libraries, teacher users

ANTONENKO, P., SERKAN TOY, S., AND NIEDERHAUSER, D. 2012. Using cluster analysis for data mining in educational technology research. Educational Technology Research and Development, 6(3), 383-398.

BAKER, R. S. J. D. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 3-17.

BAUM, L. E. AND PETRIE, T. 1966. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics 37, 1554-1563.

BEZDEK, J. C. 1981. Pattern recognition with fuzzy objective function algorithms, New York, Plenum Press.

CAMPBELL, S. B. AND MORGAN-LOPEZ, A. A. 2009. A Latent Class Analysis of maternal depressive symptoms over 12 Years and offspring adjustment in adolescence. Journal of Abnormal Psychology 118, 479-493.

CHATTERJEE, S. AND HADI, A. S. 2006. Regression analysis by example, 4th ed. John Wiley and Sons, Inc.

CHEN, H., & DOTY, P. 2005. A conceptual framework for digital libraries for k–12 mathematics education: part 1, information organization, information literacy, and integrated learning. The Library Quarterly, 75, 231- 261.

COLLINS, L.M., AND LANZA, S.T. 2010. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.

DAVIES, D. L. AND BOULDIN, D. W. 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224-227.

DOGAN, B. AND CAMURCU, A. Y. 2008. Visual Clustering of multidimensional Educational data from an intelligent tutoring system. Computer Applications in Engineering Education 18, 375-382.

DUNN, J. C. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J Cybernetics 3, 32-57.

DURFEE, A., SCHNEBERGER, S., AND D. L. AMOROSO. 2007. Evaluating students computer-based learning using a visual data mining approach. Journal of Informatics Education Research 9, 1-28.

FRIEDMAN, M. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11, 86-92.

GOODMAN, L. A. 1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. I, 2, 215-231.

GORSUCH, R. L. 1983. Factor Analysis. Lawrence Erlbaum, Hillsdale, NJ.

HAIR, J., BLACK, B., BABIN, B., ANDERSON, R. E., AND TATHAM, R. L. 2006. Multivariate Data Analysis, 6th edition, Pearson Prentice Hall, New Jersey.

HALL, D. J. AND BALL, G.B. 1965. ISODATA: A novel method of data analysis and pattern classification. Technical report, Stanford Research Institute, Menlo park CA.

HARP, S. A., SAMAD, T., AND VILLANO, M. 1995. Modeling student knowledge with self-organizing feature maps. IEEE Transactions on Systems, Man and Cybernetics 25, 727-737.

HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2009.The Elements of Statistical Learning (pp. 520-529), 2nd edition. Springer, New York.

HOSMER, D. W. AND LEMESHOW, S. 2000. Applied Logistic Regression, 2nd ed. John Wiley and Sons, Inc.

HOWELL, D. C. (2002). Statistical methods for psychology (5 th ed.). Pacific Grove, CA: Duxbury Thomson

Learning, Inc. HÜBSCHER, R., PUNTAMBEKAR, S. AND NYE, A. H. 2007. Domain specific interactive data mining. In Proceedings of Workshop on Data Mining for User Modeling at the 11th International Conference on User Modeling.

KLONSKY, E. D. AND OLINO, T. 2008. M. Identifying clinically distinct subgroups of self-injurers among young adults: A Latent Class Analysis. Journal of Counseling and Clinical Psychology 76, 22-27.

KWAK, C. AND CLAYTON-MATTHEWS, A. 2002. Multinomial logistic regression. Nursing Research 51, 404-410.

LAZARSFELD, P. F. AND HENRY, N. W. 1968. Latent Structure Analysis. Houghton Mifflin.

LEE, C. 2007. Diagnostic, predictive and compositional modeling with data mining in integrated learning environments. Computers & Education 49, 562-580.

MAULL, K. E., SALDIVAR, M. G., AND SUMNER, T. 2010. Online curriculum planning behavior of teachers. In Proceedings of the 3rd International Conference on Educational Data Mining.

MAGIDSON, J. AND VERMUNT. J. 2004. Latent class models. In The SAGE Handbook of Quantitative Methodology for the Social Sciences, D. Kaplan, Ed. Sage Publications, Thousand Oaks, CA, 175-198.

MCCUTCHEON, A. L. 1987. Latent class analysis. Quantitative Applications in the Social Sciences Series 64. Sage Publication, Thousand Oaks, California.

MINKA, T. P. 2002. Beyond Newton’s us/um/people/minka/papers/minka-newton.pdf. method.

NYLUND, K.L., ASPAROUHOV, T., & MUTHÉN, B. 2007. Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling 14, 535-569.

NYLUND, K., BELLMORE, A., NISHINA, A., AND GRAHAM, S. 2007. Subtypes, severity, and structural stability of peer victimization: What does latent class analysis say? Child Development 78, 1706-1722.

PENCE, B. W., MILLER, W. C., AND GAYNES, B. N. 2009. Prevalence estimation and validation of new instruments in psychiatric research: An application of latent class analysis and sensitivity analysis. Psychology Assessment 21, 235-219.

PERERA, D., KAY, J., KOPRINSKA, I., YACEF, K., AND ZAI ̈ANE, O. R. 2009. Clustering and sequential pattern mining of online collaborative learning data. IEEE Transactions on Knowledge and Data Engineering 21, 759-772.

PERRAULT, A. M. 2007. An Exploratory Study of Biology Teachers' Online Information Seeking Practices. School Library Media Research, 10.

RECKER, M., DORWARD, J., DAWSON, D., MAO, X., YE, L., PALMER, B, HALIORIS, S., AND PARK, J. 2006. The Annual Meeting of the American Education Research Association.

RECKER, M., WALKER, A., GIERSCH, S., MAO, X., PALMER, B., JOHNSON, D. LEARY, H., AND ROBERTSHAW, B. 2007. A study of teachers' use of online learning resources to design classroom activities. New Review of Hypermedia and Multimedia 13, 117 - 134

ROMERO, C. AND VENTURA, S. 2007. Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications 33, 135-146.

ROUSSOS, L. A., TEMPLIN, J. L., AND HENSON, R. A. 2007. Skills diagnosis using IRT-based latent class models. Journal of Educational Measurement 44, 293–311.

SHIH, B., KOEDINGER, K. R., AND SCHEINES, R. 2010. Unsupervised discovery of student learning tactics. In Proceedings of the 3rd International Conference on Educational Data Mining.

SMITH, L. I. 2002. A tutorial on principal component analysis. Cornell University

VERMUNT, J., K. AND MAGIDSON, J. 2002. Latent class cluster analysis. In Applied Latent Class Analysis, J. Hagenaars and A. McCutcheon, Eds. Cambridge University Press, 89-106.

VERMUNT, J.K. AND MAGIDSON, J. 2005. Technical guide for Latent GOLD 4.0: Basic and advanced. Statistical Innovations Inc.

VESANTO, J., AND ALHONIEMI, E. 2000. Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11, 586–600.

WALKER, A., RECKER, M., ROBERTSHAW, B., OLSEN, J., LEARY, H., YE, L., & SELLERS, H., 2011. Integrating technology and problem-based learning: A mixed methods study of two teacher professional development approaches. Interdisciplinary Journal of Problem-based Learning 5, 70-94.

WANG, W., WENG, J., SU, J., AND TSENG, S. 2004. Learning portfolio analysis and mining in SCORM compliant environment. The 34 th ASEE/IEEE Frontiers in Education Conference.

XU, B. 2011. Clustering educational digital library usage data: Comparisons of latent class analysis and K- Means algorithms. All Graduate Theses and Dissertations. Paper 954.

XU, B., AND RECKER, M. 2012. Teaching analytics: A clustering and triangulation study of digital library user data. Journal of Educational Technology & Society 15, 3, 103-115.

YPMA. T. J. 1995. Historical development of the Newton-Raphson Method. SIAM Review 37, 5, 531-551.

ZIMMERMAN, D. W. 1994. A note on the influence of outliers on parametric and nonparametric tests. Journal of General Psychology 121, 4, 391-401.