Latent Skill Mining and Labeling from Courseware Content



Published Oct 1, 2022
Noboru Matsuda Jesse Wood Raj Shrivastava Machi Shimmei Norman Bier


A model that maps the requisite skills, or knowledge components, to the contents of an online course is necessary to implement many adaptive learning technologies. However, developing a skill model and tagging courseware contents with individual skills can be expensive and error prone. We propose a technology to automatically identify latent skills from instructional text on existing online courseware called Smart (Skill Model mining with Automated detection of Resemblance among Texts). Smart is capable of mining, labeling, and mapping skills without using an existing skill model or student learning (aka response) data. The goal of our proposed approach is to mine latent skills from assessment items included in existing courseware, provide discovered skills with human-friendly labels, and map didactic paragraph texts with skills. This way, mapping between assessment items and paragraph texts is formed. In doing so, automated skill models produced by Smart will reduce the workload of courseware developers while enabling adaptive online content at the launch of the course. In our evaluation study, we applied Smart to two existing authentic online courses. We then compared machine-generated skill models and human-crafted skill models in terms of the accuracy of predicting students’ learning. We also evaluated the similarity between machine-generated and human-crafted skill models. The results show that student models based on Smart-generated skill models were equally predictive of students’ learning as those based on human-crafted skill models— as validated on two OLI (Open Learning Initiative) courses. Also, Smart can generate skill models that are highly similar to human-crafted models as evidenced by the normalized mutual information (NMI) values.

How to Cite

Matsuda, N., Wood, J., Shrivastava, R., Shimmei, M., & Bier, N. (2022). Latent Skill Mining and Labeling from Courseware Content . Journal of Educational Data Mining, 14(2).
Abstract 413 | PDF Downloads 305



skill model discovery, learning engineering, massive open online course, text mining, natural language processing

BANSAL, M., AND SHARMA, D. 2021. A novel multi-view clustering approach via proximity-based factorization targeting structural maintenance and sparsity challenges for text and image categorization. Information Processing & Management, 58(4), Elsevier, 102546.

BARNES, T. 2010. Novel derivation and application of skill matrices: The q-matrix method. In Handbook of Educational Data Mining, C. Romero, S. Ventura, M. Pechenizkiy and R. S. J. d. Baker, Eds. CRC Press, Boca Raton, FL, 159-172.

BIER, N., AND RINDERLE, J. 2011. Openness and Learning Analytics. Open Education Annual Conference, Park City, UT. Routledge.

BIER, N., STRADER, R., AND ZIMMARO, D. 2014. An approach to Skill Mapping in Online Courses. Learning with MOOCs, Cambridge, MA.

BIER, N., MOORE, S., AND VAN VELSEN, M. 2019. Instrumenting courseware and leveraging data with the Open Learning Initiative (OLI). In Companion Proceedings 9th International Learning Analytics & Knowledge Conference, J. Cunningham, N. Hoover, S. Hsiao, G. Lynch, K. McCarthy, C. Brooks, R. Ferguson, and U. Hoppe, Eds. Tempe, AZ, 990-1001.

CEN, H., KOEDINGER, K., AND JUNKER, B. 2006. Learning Factors Analysis – A General Method for Cognitive Model Evaluation and Improvement. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, M. Ideka, K.D. Ashley, and T.W. Chan, Eds. 4053, Springer, Berlin, 164–175. DOI:

CHAPLOT, D. S., MACLELLAN, C., SALAKHUTDINOV, R., AND KOEDINGER, K. 2018. Learning Cognitive Models Using Neural Networks. In Proceedings of International Conference on Artificial Intelligence in Education, C. Penstein Rosé, R. Martínez-Maldonado, U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, and B. du Boulay, Eds. Vol 10947, Springer, Cham, 43-56.

CHEN, Y., LI, X., LIU, J., AND YING, Z. 2018. Recommendation system for adaptive learning. Applied Psychological Measurement, 42(1), Sage Publications, 24-41.

CLARK, R., FELDON, D., VAN MERRIENBOER, J. J. G., YATES, K., AND EARLY, S. 2008. Cognitive task analysis. In Handbook of Research on Educational Communications and Technology, J. M. Spector, M. D. Merrill, J. J. G. van Merriënboer, and M. P. Driscoll, Eds. Macmillan/Gale, New York, NY, Routledge, 577–593.

CRANDALL B, KLEIN G, HOFFMAN RR. 2006. Working Minds: A Practitioner’s Guide To Cognitive Task Analysis. MIT Press, Cambridge, MA.

DAI, Y., ASANO, Y., YOSHIKAWA, M. 2016. Course Content Analysis: An Initiative Step toward Learning Object Recommendation Systems for MOOC Learners. In 9th Proceedings of International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. International Educational Data Mining Society, 347–52.

DESMARAIS, M. C. 2012. Mapping question items to skills with non-negative matrix factorization. ACM SIGKDD Explorations Newsletter, 13(2), ACM, 30–36. DOI:

DESMARAIS, M. C., AND BAKER, R. S. J. D. 2012. A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 22(1–2), Springer, 9–38. DOI:

DESMARAIS, M. C., AND NACEUR, R. 2013. A Matrix Factorization Method for Mapping Items to Skills and for Enhancing Expert-Based Q-Matrices. In Proceedings of the 16th International Conference on Artificial Intelligence in Education, 7926, H. C. Lane, K. Yacef, J. Mostow, and P. Pavlik, Eds. Springer, Berlin, Heidelberg. DOI:

DEVLIN J, CHANG MW, LEE K, TOUTANOVA K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 4171-4186.

GAVRILOVIĆ, N., ARSIĆ, A., DOMAZET, D., AND MISHRA, A. 2018. Algorithm for adaptive learning process and improving learners’ skills in java programming language. Computer Applications in Engineering Education, 26(5), Wiley Online Library, 1362-1382.

GONZALEZ-BRENES, J. P., AND MOSTOW, J. 2012. Dynamic Cognitive Tracing: Towards Unified Discovery of Student and Cognitive Models. In Proceedings of the 5th International Conference on Educational Data Mining, K. Yacef, O. Zaïane, A. Hershkovitz, M. Yudelson, and J. Stamper Eds. International Educational Data Mining Society, 49-56.

HARIS, S. S., AND OMAR, N. 2012. A rule-based approach in Bloom’s Taxonomy question classification through natural language processing. In 2012 7th International Conference on Computing and Convergence Technology (ICCCT), K. D. Kwack, S. Kawata, S, Hwang, D. Han, and F. Ko, Eds. IEEE, 410–414.

HARTIGAN, J. A., AND WONG, M. A. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), Wiley, 100–108. DOI:

IMHOF, C., BERGAMIN, P., AND MCGARRITY, S. 2020. Implementation of Adaptive Learning Systems: Current State and Potential. In Online Teaching and Learning in Higher Education, P. Isaias, D. G. Sampson, and D. Ifenthaler, Eds. Springer, Cham, 93–115. DOI:

JOVANOVIC, D., AND JOVANOVIC, S. 2015. An adaptive e-learning system for java programming course, based on Dokeos LE. Computer Applications in Engineering Education, 23(3), Wiley Online Library, 337-343.

KETCHEN, D., AND SHOOK, C. 1996. The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17(6), Wiley Online Library, 441-458.

KOEDINGER, K. R., BAKER, R., CUNNINGHAM, K., SKOGSHOLM, A., LEBER, B., AND STAMPER, J. 2010. A Data Repository for the EDM community: The PSLC DataShop. In Handbook of Educational Data Mining, C. Romero, S. Ventura, M. Pechenizkiy and R. S. J. d. Baker, Eds. CRC Press, Boca Raton, FL.

KOEDINGER, K. R., CORBETT, A. T., AND PERFETTI, C. 2012. The Knowledge-Learning-Instruction Framework: Bridging the Science-Practice Chasm to Enhance Robust Student Learning. Cognitive Science, 36(5), Wiley Online Library, 757–798.

KOEDINGER, K. R., MCLAUGHLIN, E. A., AND STAMPER, J. C. 2012. Automated Student Model Improvement. In Proceedings of the 5th International Conference on Educational Data Mining, K. Yacef, O. Zaïane, A. Hershkovitz, M. Yudelson, and J. Stamper Eds. International Educational Data Mining Society, 383-395

KOEDINGER, K. R., AND NATHAN, M. J. 2004. The real story behind story problems: Effects of representations on quantitative reasoning. The Journal of the Learning Sciences, 13(2), Taylor & Francis, 129-164.

KULLBACK, S., AND LEIBLER, R. A. 1951. On information and sufficiency. The Annals of Mathematical Statistics, 22(1), Institute of Mathematical Statistics, 79-86.

LIU, M., MCKELROY, E., CORLISS, S. B., AND CARRIGAN, J. 2017. Investigating the effect of an adaptive learning intervention on students' learning. Educational Technology Research and Development, 65(6), Springer, 1605-1625.

LOSSIO-VENTURA, J. A., GONZALES, S., MORZAN, J., ALATRISTA-SALAS, H., HERNANDEZ-BOUSSARD, T., AND BIAN, J. 2021. Evaluation of clustering and topic modeling methods over health-related tweets and emails. Artificial Intelligence in Medicine, Elsevier, 117.

MARTIN, B., MITROVIC, T., MATHAN, S., AND KOEDINGER, K. R. (2005). On using learning curves to evaluate ITS: Automatic and semi-automatic skill coding with a view towards supporting on-line assessment. In Proceedings of the 12th International Conference on Artificial Intelligence in Education C. K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, Eds. Springer, Cham, 419-426.

MARTIN, B., MITROVIC, A., KOEDINGER, K. R., AND MATHAN, S. (2011). Evaluating and improving adaptive educational systems with learning curves. User Modeling and User-Adapted Interaction, 21(3), Springer. 249-283. doi: 10.1007/s11257-010-9084-2

MATSUDA, N., FURUKAWA, T., BIER, N., AND FALOUTSOS, C. 2015. Machine Beats Experts: Automatic Discovery of Skill Models for Data-Driven Online Course Refinement. In Proceedings of the 8th International Conference on Educational Data Mining, O.C. Santos, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J.M. Luna, C. Mihaescu, P. Moreno, A. Hershkovitz, S. Ventura, and M. Desmarais, Eds. International Educational Data Mining Society. 101-108.

MATSUDA, N., SHIMMEI, M., CHAUDHURI, P., MAKAM, D., SHRIVASTAVA, R., WOOD, J., AND TANEJA, P. (in press). PASTEL: Evidence-based learning engineering methods to facilitate creation of adaptive online courseware. In Artificial Intelligence in STEM Education: The Paradigmatic Shifts in Research, Education, and Technology. F. Ouyang, P. Jiao, B. M. McLaren and A. H. Alavi, Eds. New York, NY: CSC Press, 1-16.

MIHALCEA, R., AND TARAU, P. 2004. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 404–411. MISLEVY, R. J., ALMOND, R. G., AND LUKAS, J. F. (2003). A Brief Introduction to Evidence-centered Design. ETS Research Report Series, 2003(1), Wiley Online Library, 1-29. DOI:

PAQUETTE, G., MARIÑO, O., ROGOZAN, D., AND LÉONARD, M. 2015. Competency-based personalization for massive online learning. Smart Learning Environments, 2(1), Springer, 4. DOI:

PELÁNEK, R. 2017. Bayesian knowledge tracing, logistic models, and beyond: an overview of learner modeling techniques. User Modeling and User-Adapted Interaction, 27(3), Springer, 313-350.

REIMERS, N., GUREVYCH, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, 3973-3983.

RIHÁK, J., AND PELÁNEK, R. 2017. Measuring Similarity of Educational Items Using Data on Learners’ Performance. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society. 16-23.

ROUSSEEUW, P. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, Elsevier, 53-65.

SALTON, G., AND BUCKLEY, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), Elsevier, 513–523. DOI:

SALTON, G., AND MCGILL, M. J. 1983. Introduction to modern information retrieval. McGraw-Hill, New York, NY.

SHEN, J. T., YAMASHITA, M., PRIHAR, E., HEFFERNAN, N., WU, X., MCGREW, S., AND LEE, D. 2021. Classifying math knowledge components via task-adaptive pre-trained BERT. In Proceedings of the 24th International Conference on Artificial Intelligence in Education, I. Roll, D. McNamara, S. Sosnovsky, R. Luckin, and V. Dimitrova, Eds. Springer, Cham, 408-419.

SHERKAT, E., VELCIN, J., AND MILIOS, E. E. 2018. Fast and Simple Deterministic Seeding of KMeans for Text Document Clustering. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, P. Bellot, C. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, and N. Ferro, Eds. Springer, Berlin, Heidelbert, 76–88.

SHMUELI, G. 2010. To Explain or to Predict? Statistical Science, 25(3), Institute of Mathematical Statistics, 289–310.

SHUTE, V. J., TORREANO, L. A., AND WILLIS, R. E. 2000. DNA: Providing the blueprint for instruction. In Cognitive Task Analysis, Psychology Press, 85-100.

ŚMIEJA, M., HAJTO, K. & TABOR, J. 2019. Efficient mixture model for clustering of sparse high dimensional binary data. Data Mining and Knowledge Discovery, 33, Springer, 1583-1624.

STAMPER, J., AND KOEDINGER, K. 2011. Human-machine student model discovery and improvement using data. In Proceedings of the 15th International Conference on Artificial Intelligence in Education, G. Biswas, S. Bull, J. Kay, and A. Mitrovic, Eds. Springer, Berlin, Heidelberg, 353–360.

SUPRAJA, S., HARTMAN, K., TATINATI, S., AND KHONG, A. W. H. 2017. Toward the Automatic Labeling of Course Questions for Ensuring Their Alignment with Learning Outcomes. In Proceedings of the 10th International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society. 56-63.

TATSUOKA, K. K. 1983. Rule Space: An Approach for Dealing with Misconceptions Based on Item Response Theory. Journal of Educational Measurement, 20(4), JSTOR, 345–354.

THORNDIKE, R. L. 1953. Who belongs in the family? Psychometrika, 18(4), 267-276.

TYTON PARTNERS. 2020. Time for Class 2020. Tyton Partners and Bay View Analytics in Partnership with Every Learner Everywhere, posted July 2020,

VINH, N. X., EPPS, J., AND BAILEY, J. 2009. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, A. Danyluk, L. Bottou, and M. Littman, Eds. Association for Computing Machinery, 1073–1080. DOI:

WALKINGTON, C. A. 2013. Using adaptive learning technologies to personalize instruction to student interests: The impact of relevant contexts on performance and learning outcomes. Journal of Educational Psychology, 105(4), American Psychological Association, 932-945.

WANG, W., SONG, L., DING, S., WANG, T., GAO, P., AND XIONG, J. 2020. A Semi-supervised Learning Method for Q-Matrix Specification Under the DINA and DINO Model With Independent Structure. Frontiers in Psychology, 11(2120). Frontiers.

WINTERS, T., SHELTON, C., PAYNE, T., AND MEI, G. 2005. Topic extraction from item-level grades. In American Association for Artificial Intelligence 2005 Workshop on Educational Datamining. AAAI.

YANG, Y. C., GAMBLE, J. H., HUNG, Y., AND LIN, T. 2014. An online adaptive learning environment for critical-thinking-infused English literacy instruction. British Journal of Educational Technology, 45(4), Wiley Online Library, 723-747.

ZAMORA, J. 2017. Recent Advances in High-Dimensional Clustering for Text Data. In Claudio Moraga: A Passion for Multi-Valued Logic and Soft Computing, Springer, 323-337.
EDM 2022 Journal Track