Towards Interpretable Automated Machine Learning for STEM Career Prediction
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
In this paper, we describe our solution to predict student STEM career choices during the 2017 ASSISTments Datamining Competition. We built a machine learning system that automatically reformats the data set, generates new features and prunes redundant ones, and performs model and feature selection. We designed the system to automatically find a model that optimizes prediction performance, yet the final model is a simple logistic regression that allows researchers to discover important features and study their effects on STEM career choices. We also compared our method to other methods, which revealed that the key to good prediction is proper feature enrichment in the beginning stage of the data analysis, while feature selection in a later stage allows a simpler final model.
How to Cite
##plugins.themes.bootstrap3.article.details##
STEM careers, automated prediction, penalized logistic regression, forward-backward search algorithm, interpretable machine learning
BREHENY, P. and HUANG, J. 2011. Coordinate Descent Algorithms for Nonconvex Penalized Regression, with Applications to Biological Feature Selection. Annals of Applied Statistics, 5, 232-253.
CORBETT, A. T. and ANDERSON, J. R. 1995. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction, 4 (4), 253-278.
FAN, J. and LI, R. 2001. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
FENG, M., HEFFERNAN, N. and KOEDINGER, K. 2009. Addressing the Assessment Challenge with an Online System That Tutors as it Assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research, 19 (3), 243-266.
FRIEDMAN, J. 2001. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29 (5), 1189-1232. 14HOERL, A.E., and KENNARD, R.W. 1970. Ridge Regression: Biased Estimation for Non- orthogonal Problems. Technometrics, 12, 55-67.
KNOWLES, J. E. 2014. EWStools: Tools for Automating the Testing and Evaluation of Education Early Warning System Models. R package version 0.1.
KNOWLES, J. E. 2015. Of Needles and Haystacks: Building an Accurate Statewide Dropout Early Warning System in Wisconsin. Journal of Educational Data Mining, 7 (3), 18-67.
PARDOS, Z.A., BAKER, R.S., S AN PEDRO, M.O.C.Z., GOWDA, Sujith M. and G OWDA, Supreeth M. 2014. Affective States and State tests: Investigating How Affect and Engagement during the School Year Predict End-of-Year Learning Outcomes. Journal of Learning Analytics, 1 (1), 107-128.
RAZZAQ, L., HEFFERNAN, N.T., FENG, M., and PARDOS, Z.A. 2007. Developing Fine-Grained Transfer Models in the ASSISTment System. Journal of Technology, Instruction, Cognition, and Learning, 5 (3). Old City Publishing, Philadelphia, PA. 2007. 289-304.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.
SAN PEDRO, M.O.C.Z., BAKER, R. S., BOWERS, A., and HEFFERNAN, N. 2013. Predicting College Enrollment from Student Interaction with an Intelligent Tutoring System in Middle School. In Proceedings of the 6th International Conference on Educational Data Mining, 177–184.
SAN PEDRO, M.O.C.Z., BAKER, R. S., and RODRIGO, M. M. T. 2014. Carelessness and Affect in an Intelligent Tutoring System for Mathematics. International Journal of Artificial Intelligence in Education, 24(2), 189-210.
SAN PEDRO, M.O.C.Z., OCUMPOUGH, J. L., BAKER, R. S., HEFFERNAN, N. 2014. Predicting STEM and non-STEM College Major Enrollment from Middle School Interaction with Mathematics Educational Software. Proceedings of the 7th International Conference on Educational Data Mining, 276-279.
SUGIYAMA, M., KRAULEDAT, M., and M ÜLLER , K.-R. (2007). Covariate Shift Adaptation by Importance Weighted Cross Validation. Journal of Machine Learning Research, 8, 985-1005.
TIBSHIRANI, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1), 267-288.
ZHANG, C. 2010. Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics, 38 (2), 894-942.
ZOU, H. and HASTIE, T. 2005. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301-320.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.