This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect (IA.usu.edu). Using a multi-faceted approach and multiple data sources, three types of comparisons of resulting clusters are presented: 1) Davies-Bouldin indices, 2) clustering results validated with user profile data, and 3) cluster evolution. Latent Class Analysis is superior to K-means on all three comparisons. In particular, LCA is more immune to the variance of feature variables, and clustering results turn out well with minimal data transformation. Our research results also show that LCA perform better than K-means in terms of providing the most useful educational interpretation for this dataset.
How to Cite
educational data mining, educational web mining, clustering, latent class analysis, k-means, digital libraries, teacher users
BAKER, R. S. J. D. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1, 3-17.
BAUM, L. E. AND PETRIE, T. 1966. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics 37, 1554-1563.
BEZDEK, J. C. 1981. Pattern recognition with fuzzy objective function algorithms, New York, Plenum Press.
CAMPBELL, S. B. AND MORGAN-LOPEZ, A. A. 2009. A Latent Class Analysis of maternal depressive symptoms over 12 Years and offspring adjustment in adolescence. Journal of Abnormal Psychology 118, 479-493.
CHATTERJEE, S. AND HADI, A. S. 2006. Regression analysis by example, 4th ed. John Wiley and Sons, Inc.
CHEN, H., & DOTY, P. 2005. A conceptual framework for digital libraries for k–12 mathematics education: part 1, information organization, information literacy, and integrated learning. The Library Quarterly, 75, 231- 261.
COLLINS, L.M., AND LANZA, S.T. 2010. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.
DAVIES, D. L. AND BOULDIN, D. W. 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224-227.
DOGAN, B. AND CAMURCU, A. Y. 2008. Visual Clustering of multidimensional Educational data from an intelligent tutoring system. Computer Applications in Engineering Education 18, 375-382.
DUNN, J. C. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J Cybernetics 3, 32-57.
DURFEE, A., SCHNEBERGER, S., AND D. L. AMOROSO. 2007. Evaluating students computer-based learning using a visual data mining approach. Journal of Informatics Education Research 9, 1-28.
FRIEDMAN, M. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11, 86-92.
GOODMAN, L. A. 1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. I, 2, 215-231.
GORSUCH, R. L. 1983. Factor Analysis. Lawrence Erlbaum, Hillsdale, NJ.
HAIR, J., BLACK, B., BABIN, B., ANDERSON, R. E., AND TATHAM, R. L. 2006. Multivariate Data Analysis, 6th edition, Pearson Prentice Hall, New Jersey.
HALL, D. J. AND BALL, G.B. 1965. ISODATA: A novel method of data analysis and pattern classification. Technical report, Stanford Research Institute, Menlo park CA.
HARP, S. A., SAMAD, T., AND VILLANO, M. 1995. Modeling student knowledge with self-organizing feature maps. IEEE Transactions on Systems, Man and Cybernetics 25, 727-737.
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2009.The Elements of Statistical Learning (pp. 520-529), 2nd edition. Springer, New York.
HOSMER, D. W. AND LEMESHOW, S. 2000. Applied Logistic Regression, 2nd ed. John Wiley and Sons, Inc.
HOWELL, D. C. (2002). Statistical methods for psychology (5 th ed.). Pacific Grove, CA: Duxbury Thomson
Learning, Inc. HÜBSCHER, R., PUNTAMBEKAR, S. AND NYE, A. H. 2007. Domain specific interactive data mining. In Proceedings of Workshop on Data Mining for User Modeling at the 11th International Conference on User Modeling.
KLONSKY, E. D. AND OLINO, T. 2008. M. Identifying clinically distinct subgroups of self-injurers among young adults: A Latent Class Analysis. Journal of Counseling and Clinical Psychology 76, 22-27.
KWAK, C. AND CLAYTON-MATTHEWS, A. 2002. Multinomial logistic regression. Nursing Research 51, 404-410.
LAZARSFELD, P. F. AND HENRY, N. W. 1968. Latent Structure Analysis. Houghton Mifflin.
LEE, C. 2007. Diagnostic, predictive and compositional modeling with data mining in integrated learning environments. Computers & Education 49, 562-580.
MAULL, K. E., SALDIVAR, M. G., AND SUMNER, T. 2010. Online curriculum planning behavior of teachers. In Proceedings of the 3rd International Conference on Educational Data Mining.
MAGIDSON, J. AND VERMUNT. J. 2004. Latent class models. In The SAGE Handbook of Quantitative Methodology for the Social Sciences, D. Kaplan, Ed. Sage Publications, Thousand Oaks, CA, 175-198.
MCCUTCHEON, A. L. 1987. Latent class analysis. Quantitative Applications in the Social Sciences Series 64. Sage Publication, Thousand Oaks, California.
MINKA, T. P. 2002. Beyond Newton’s us/um/people/minka/papers/minka-newton.pdf. method. http://research.microsoft.com/en-
NYLUND, K.L., ASPAROUHOV, T., & MUTHÉN, B. 2007. Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling 14, 535-569.
NYLUND, K., BELLMORE, A., NISHINA, A., AND GRAHAM, S. 2007. Subtypes, severity, and structural stability of peer victimization: What does latent class analysis say? Child Development 78, 1706-1722.
PENCE, B. W., MILLER, W. C., AND GAYNES, B. N. 2009. Prevalence estimation and validation of new instruments in psychiatric research: An application of latent class analysis and sensitivity analysis. Psychology Assessment 21, 235-219.
PERERA, D., KAY, J., KOPRINSKA, I., YACEF, K., AND ZAI ̈ANE, O. R. 2009. Clustering and sequential pattern mining of online collaborative learning data. IEEE Transactions on Knowledge and Data Engineering 21, 759-772.
PERRAULT, A. M. 2007. An Exploratory Study of Biology Teachers' Online Information Seeking Practices. School Library Media Research, 10.
RECKER, M., DORWARD, J., DAWSON, D., MAO, X., YE, L., PALMER, B, HALIORIS, S., AND PARK, J. 2006. The Annual Meeting of the American Education Research Association.
RECKER, M., WALKER, A., GIERSCH, S., MAO, X., PALMER, B., JOHNSON, D. LEARY, H., AND ROBERTSHAW, B. 2007. A study of teachers' use of online learning resources to design classroom activities. New Review of Hypermedia and Multimedia 13, 117 - 134
ROMERO, C. AND VENTURA, S. 2007. Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications 33, 135-146.
ROUSSOS, L. A., TEMPLIN, J. L., AND HENSON, R. A. 2007. Skills diagnosis using IRT-based latent class models. Journal of Educational Measurement 44, 293–311.
SHIH, B., KOEDINGER, K. R., AND SCHEINES, R. 2010. Unsupervised discovery of student learning tactics. In Proceedings of the 3rd International Conference on Educational Data Mining.
SMITH, L. I. 2002. A tutorial on principal component analysis. Cornell University
VERMUNT, J., K. AND MAGIDSON, J. 2002. Latent class cluster analysis. In Applied Latent Class Analysis, J. Hagenaars and A. McCutcheon, Eds. Cambridge University Press, 89-106.
VERMUNT, J.K. AND MAGIDSON, J. 2005. Technical guide for Latent GOLD 4.0: Basic and advanced. Statistical Innovations Inc.
VESANTO, J., AND ALHONIEMI, E. 2000. Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11, 586–600.
WALKER, A., RECKER, M., ROBERTSHAW, B., OLSEN, J., LEARY, H., YE, L., & SELLERS, H., 2011. Integrating technology and problem-based learning: A mixed methods study of two teacher professional development approaches. Interdisciplinary Journal of Problem-based Learning 5, 70-94.
WANG, W., WENG, J., SU, J., AND TSENG, S. 2004. Learning portfolio analysis and mining in SCORM compliant environment. The 34 th ASEE/IEEE Frontiers in Education Conference.
XU, B. 2011. Clustering educational digital library usage data: Comparisons of latent class analysis and K- Means algorithms. All Graduate Theses and Dissertations. Paper 954. http://digitalcommons.usu.edu/etd/954
XU, B., AND RECKER, M. 2012. Teaching analytics: A clustering and triangulation study of digital library user data. Journal of Educational Technology & Society 15, 3, 103-115.
YPMA. T. J. 1995. Historical development of the Newton-Raphson Method. SIAM Review 37, 5, 531-551.
ZIMMERMAN, D. W. 1994. A note on the influence of outliers on parametric and nonparametric tests. Journal of General Psychology 121, 4, 391-401.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.