Of Needles and Haystacks: Building an Accurate Statewide Dropout Early Warning System in Wisconsin
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
The state ofWisconsin has one of the highest four year graduation rates in the nation, but deep disparities among student subgroups remain. To address this the state has created the Wisconsin Dropout Early Warning System (DEWS), a predictive model of student dropout risk for students in grades six through nine. The Wisconsin DEWS is in use statewide and currently provides predictions on the likelihood of graduation for over 225,000 students. DEWS represents a novel statistical learning based approach to the challenge of assessing the risk of non-graduation for students and provides highly accurate predictions for students in the middle grades without expanding beyond mandated administrative data collections. Similar dropout early warning systems are in place in many jurisdictions across the country. Prior research has shown that in many cases the indicators used by such systems do a poor job of balancing the trade off between correct classification of likely dropouts and false-alarm (Bowers et al., 2013). Building on this work, DEWS uses the receiver-operating characteristic (ROC) metric to identify the best possible set of statistical models for making predictions about individual students. This paper describes the DEWS approach and the software behind it, which leverages the open source statistical language R (R Core Team, 2013). As a result DEWS is a flexible series of software modules that can adapt to new data, new algorithms, and new outcome variables to not only predict dropout, but also impute key predictors as well. The design and implementation of each of these modules is described in detail as well as the open-source R package, EWStools, that serves as the core of DEWS (Knowles, 2014).
How to Cite
##plugins.themes.bootstrap3.article.details##
Dropout Early Warning System, false-alarm, receiver-operating characteristic (ROC) metric, impute key predictors
ALLENSWORTH, E. 2013. The use of ninth-grade early warning indicators to improve chicago schools. Journal of Education for Students Placed at Risk 1, 68–83.
BALFANZ, R. 2009. Putting middle grades students on the graduation path: A policy and practice brief. Tech. rep., National Middle School Association, Westerville, Ohio. http://www.amle.org/portals/0/pdf/research/research_from_the_ field/policy_brief_balfanz.pdf.
BALFANZ, R. AND HERZOG, L. 2006. Keeping middle grades students on-track to graduation: Initial analysis and implications. http://web.jhu.edu/CSOS/graduation-gap/edweek/ Balfanz_Herzog.ppt.
BALFANZ, R. AND IVER, D. M. 2006. Closing the mathematics achievement gap in high poverty middle schools: Enablers and constraints. Journal of Education for Students Placed At Risk 11, 2, 143–159.
BALFANZ, R. AND IVER, D. M. 2007. Preventing student disengagement and keeping students on the graduation path in the urban middle grade schools: Early identification and effective interventions. Educational Pyschologist 42, 4, 223–235.
BALFANZ, R. AND LEGTERS, N. 2004. Locating the dropout crisis: Which high schools produce the nation’s dropouts? Tech. Rep. 70, Center for Research on the Education of Students Placed At Risk,
Baltimore, MD. http://files.eric.ed.gov/fulltext/ED484525.pdf.
BOWERS, A. J. AND SPROTT, R. 2012a. Examining the multiple trajectories associated with dropping out of high school: A growth mixture model analysis. The Journal of Educational Research 105, 176–195.
BOWERS, A. J. AND SPROTT, R. 2012b. Why tenth graders fail to finish high school: A dropout typology latent class analysis. Journal of Education for Students Placed at Risk 17, 129–148.
BOWERS, A. J., SPROTT, R., AND TAFF, S. A. 2013. Do we know who will drop out? a review of the predictors of dropping out of high school: Precision, sensitivity, and specificity. The High School Journal 96, 77–100.
BREIMAN, L. 2001a. Random forests. Machine Learning 45, 1, 5–32.
BREIMAN, L. 2001b. Statistical modeling: The two cultures. Statistical Science 16, 199–231.
BURNHAM, K. AND ANDERSON, D. 2002. Model Selection and Multi Model Inference: A Practical Information-Theoretic Approach, Second ed. Springer, New York. ISBN 0-387-95364-7.
CARL, B., RICHARDSON, J. T., CHENG, E., KIM, H., AND MEYER, R. H. 2013. Theory and application of early warning systems for high school and beyond. Journal of Education for Students Placed at Risk 1, 29–49.
CHAPELLE, O., VAPNIK, V., BOUSQUET, O., AND MUKHERJEE, S. 2002. Choosing multiple parameters for support vector machines. Machine Learning 46, 1-3, 131–159.
CHATFIELD, C. 1995. Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society 158, 419–466.
DAHL, D. B. 2013. xtable: Export tables to LaTeX or HTML. R package version 1.7-1.
DASU, T. AND JOHNSON, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley-Interscience, New York.
DAVIS, M., HERZOG, L., AND LEGTERS, N. 2013. Organizing schools to address early warning indicators (ewis): Common practices and challenges. Journal of Education for Students Placed at Risk 1, 84–100.
DOWLE, M., SHORT, T., AND LIANOGLOU, S. 2013. data.table: Extension of data.frame for fast indexing, fast ordered joins,fast assignment, fast grouping and list columns. R package version 1.8.8.
EASTON, J. AND ALLENSWORTH, E. 2005. The on-track indicator as a predictor of high school graduation. Tech. rep., Consortium on Chicago School Research, Chicago. http://ccsr.uchicago. edu/publications/track-indicator-predictor-high-school-graduation.
EASTON, J. AND ALLENSWORTH, E. 2007. What matters for staying on-track and graduating in chicago public high schools: A close look at course grades, failures, and attendance in the freshman year. Tech. rep., Consortium on Chicago School Research, Chicago. http://ccsr.uchicago.edu/publications/ what-matters-staying-track-and-graduating-chicago-public-schools.
EFROYMSON, M. 1960. Multiple regression analysis. In Mathematical Methods for Digital Computers, A. Ralston and H. Wilf, Eds. Wiley, New York.
EVERS, A. 2012. Agenda 2017: Every child a graduate college and career ready. http://dpi.wi. gov/sprntdnt/pdf/agenda2017.pdf.
FRIEDMAN, J., HASTIE, T., AND TIBSHIRANI, R. 2000. Additive logistic regression: A statistical view of boosting. Annals of Statistics 28, 2, 337–374.
GELMAN, A., CARLIN, J. B., STERN, H. S., DUNSON, D. B., VEHTARI, A., AND RUBIN, D. B. 2013. Bayesian Data Analysis, Third ed. Chapman & Hall / CRC Texts in Statistical Science, London.
GELMAN, A. AND HILL, J. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge.
GLEASON, P. AND DYNARSKI, M. 2002. Do we know whom to serve? issues in using risk factors to identify dropouts. Journal of Education for Students Placed At Risk 7, 25–41. http://www. mathematica-mpr.com/publications/PDFs/dod-risk.pdf.
GR¨O MPING, U. 2006. Relative importance for linear regression in r: The package relaimpo. The Journal of Statistical Software 17.
HANCZAR, B., HUA, J., SIMA, C., WEINSTEIN, J., BITTNER, M., AND DOUGHERTY, E. 2010. Smallsample precision of roc-related estimates. Bioinformatics 26, 822–830.
HAND, D. J. 2009. Measuring classifier performance: A coherent alternative to the area under the roc curve. Machine Learning 77, 103–123.
HANLEY, J. AND MCNEIL, B. 1982. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 29–36.
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 ed. Springer New York, New York. http://books.google.com/books/about/The_Elements_of_Statistical_ Learning.html?id=tVIjmNS3Ob8C.
HEPPEN, J. B. AND THERRIAULT, S. B. 2008. Developing early warning systems to identify potential high school dropouts. Tech. rep., National High School Center, Washington D.C. http://www.betterhighschools.org/pubs/documents/IssueBrief_ EarlyWarningSystemsGuide.pdf.
HLAVAC, M. 2013. stargazer: LaTeX code for well-formatted regression and summary statistics tables. Harvard University, Cambridge, USA. R package version 3.0.1.
HONAKER, J., KING, G., AND BLACKWELL, M. 2011. Amelia II: A program for missing data. Journal of Statistical Software 45, 7, 1–47.
INMON, W. 2005. Building the Data Warehouse, Fourth ed. John Wiley and Sons, New York. ISBN 978-1265-0645-3.
JAMES, G., WITTEN, D., HASTIE, T., AND TIBSHIRANI, R. 2013. An Introduction to Statistical Learning, 1 ed. Springer New York, New York.
JANOSZ, M., ARCHAMBAULT, I., MORIZOT, J., AND PAGANI, L. S. 2008. School engagement trajectories and their differential predictive relations to dropout. Journal of Social Issues 64, 1, 21–40.
JERALD, C. D. 2006. Identifying potential dropouts: Key lessons for building an early warning data system. Tech. rep., Achieve, Inc.,Washington D.C. http://www.jff.org/sites/default/ files/IdentifyingPotentialDropouts.pdf.
KEMPLE, J. J., SEGERITZ, M. D., AND STEPHENSON, N. 2013. Building on-track indicators for high school graduation and college readiness: Evidence from new york city. Journal of Education for Students Placed at Risk 1, 7–28.
KENNELLY, L. AND MONRAD, M. 2007. Approaches to dropout prevention: Heeding early warning signs with appropriate interventions. Tech. rep., National High School Center, Washington D.C. http://www.betterhighschools.org/docs/nhsc_ approachestodropoutprevention.pdf.
KIMBALL, R. AND ROSS, M. 2002. The Data Warehouse Toolkit, Second ed. JohnWiley and Sons, New York. ISBN 0-471-20024-7.
KNOWLES, J. AND WHITE, D. 2013. The wisconsin dropout early warning system action guide. Tech. rep., Wisconsin Department of Public Instruction, Madison, WI. http://wise.dpi.wi.gov/ files/wise/pdf/wi-dews-actionguide.pdf.
KNOWLES, J. E. 2014. EWStools: Tools for automating the testing and evaluation of education early warning system models. R package version 0.1.
KUHN, M. AND JOHNSON, K. 2013. Applied Predictive Modeling, First ed. Springer, New York. ISBN 978-1-4614-6848-6.
KUHN, M., WESTON, S., AND CODE FOR C5.0 BY R. QUINLAN, N. C. C. 2013. C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.0-15.
KUHN, M., WING, J., WESTON, S., WILLIAMS, A., KEEFER, C., ENGELHARDT, A., AND COOPER, T. 2013. caret: Classification and Regression Training. R package version 5.15-61.
KUNCHEVA, L. AND WHITAKER, C. 2003. Measures of diversity in classifier ensembles. Machine Learning 51, 181–207.
LOBO, J. M., JIMNEZ-VALVERDE, A., AND REAL, R. 2008. Auc: A misleading mesure of the performance of predictive distribution models. Global Ecology and Biogeography 17, 145–151.
MAYER, Z. AND KNOWLES, J. 2014. caretEnsemble: Framework for combining caret models into ensembles. R package version 1.0.
MUTH`E N, B. 2004. Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In The SAGE Handbook of Quantitative Methodology for the Social Sciences, D. Kaplan, Ed. Sage, Thousand Oaks, CA, 345–370.
NEILD, R. C., STONER-EBY, S., AND FURSTENBERG, F. 2008. Connecting entrance and departure: the transition to ninth grade and high school dropout. Education and Urban Society 50, 543–569.
PEDREGOSA, F., VAROQUAUX, G., GRAMFORT, A., MICHEL, V., THIRION, B., GRISEL, O., BLONDEL, M., PRETTENHOFER, P., WEISS, R., DUBOURG, V., VANDERPLAS, J., PASSOS, A., COURNAPEAU, D., BRUCHER, M., PERROT, M., AND DUCHESNAY, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.
R CORE TEAM. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
RIPLEY, B. AND LAPSLEY, M. 2012. RODBC: ODBC Database Access. R package version 1.3-6.
ROBIN, X., TURCK, N., HAINARD, A., TIBERTI, N., LISACEK, F., SANCHEZ, J.-C., AND MLLER, M. 2011. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics 12, 77.
RODERICK, M. 1993. The path to dropping out: Evidence for Intervention. Auburn House, Westport, CT.
RODERICK, M. AND CAMBURN, E. 1999. Risk and recovery from course failure in the early years of high school. American Educational Research Journal 36, 303–344. http: //www.eric.ed.gov/ERICWebPortal/search/detailmini.jsp?_nfpb=true& _&ERICExtSearch_SearchValue_0=EJ600524&ERICExtSearch_SearchType_0= no&accno=EJ600524.
RUMBERGER, R. W. 1995. Dropping out of middle school: A multilevel analysis of students and schools. American Educational Research Journal 32, 583–625. http://www.education. ucsb.edu/rumberger/internet%20pages/Papers/Rumberger--Droputs% 20from%20middle%20school%20(AERJ%201995).pdf.
SCULLEY, D., HOLT, G., GOLOVIN, D., DAVYDOV, E., PHILLIPS, T., EBNER, D., CHAUDHARY, V., AND YOUNG, M. 2014. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).
SINCLAIR, M., CHRISTENSON, S., AND THURLOW, M. 2005. Promoting school completion of urban secondary youth with emotional or behavioral disabilities. Exceptional Children 71, 465–482. http://www.iod.unh.edu/APEX%20Trainings/Tier%202% 20Manual/Additional%20Reading/3.%20Check%20and%20Connect.pdf.
SOLLICH, P. AND KROGH, A. 1996. Learning with ensembles: how overfitting can be useful. Advances in Neural Information Processing Systems 8, 190–196.
SWETS, J. 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293. THE STRATEGIC DATA PROJECT. 2012. The strategic data project toolkit version 1.1. http://www. gse.harvard.edu/˜pfpie/index.php/sdp/tools.
US DEPARTMENT OF EDUCATION. 2012a. Slds data use issue brief iii: Turning administrative data into research-ready longitudinal datasets. Tech. rep., US Department of Education, Washington, D.C. http://nces.ed.gov/programs/slds/pdf/Data-Use-Issue-Brief-3_ Research-Ready-Datasets.pdf.
US DEPARTMENT OF EDUCATION. 2012b. Slds data use issue brief iv: Techniques for analyzing longitudinal administrative data. Tech. rep., US Department of Education, Washington, D.C. http://nces.ed.gov/programs/slds/pdf/Data-Use-Issue-Brief-4_ Analysis-Techniques.pdf.
VAPNIK, V. 1998. Statistical Learning Theory. John Wiley and Sons, New York NY.
VENABLES, W. N. AND RIPLEY, B. D. 2002. Modern Applied Statistics with S, Fourth ed. Springer, New York. ISBN 0-387-95457-0.
VIVO, J. AND FRANCO, M. 2008. How does one assess the accuracy of academic success predictors? roc analysis applied to university entrance factors. International Journal of Mathematical Education in Science and Technology 39, 325–340.
WICKHAM, H. 2009. ggplot2: elegant graphics for data analysis. Springer New York. http://had. co.nz/ggplot2/book.
WICKHAM, H. 2014. Tidy data. The Journal of Statistical Software 59.
XIE, Y. 2013. knitr: A general-purpose package for dynamic report generation in R. R package version 1.1.
YOUDEN, W. 1950. Index for rating diagnostic tests. Cancer 3, 32–35.
ZWIEG, M. AND CAMPBELL, G. 1993. Receiver-operating characteristic (roc) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry 39, 561–577.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.