Of Needles and Haystacks: Building an Accurate Statewide Dropout Early Warning System in Wisconsin



Published Jul 25, 2015
Jared E Knowles


The state ofWisconsin has one of the highest four year graduation rates in the nation, but deep disparities among student subgroups remain. To address this the state has created the Wisconsin Dropout Early Warning System (DEWS), a predictive model of student dropout risk for students in grades six through nine. The Wisconsin DEWS is in use statewide and currently provides predictions on the likelihood of graduation for over 225,000 students. DEWS represents a novel statistical learning based approach to the challenge of assessing the risk of non-graduation for students and provides highly accurate predictions for students in the middle grades without expanding beyond mandated administrative data collections. Similar dropout early warning systems are in place in many jurisdictions across the country. Prior research has shown that in many cases the indicators used by such systems do a poor job of balancing the trade off between correct classification of likely dropouts and false-alarm (Bowers et al., 2013). Building on this work, DEWS uses the receiver-operating characteristic (ROC) metric to identify the best possible set of statistical models for making predictions about individual students. This paper describes the DEWS approach and the software behind it, which leverages the open source statistical language R (R Core Team, 2013). As a result DEWS is a flexible series of software modules that can adapt to new data, new algorithms, and new outcome variables to not only predict dropout, but also impute key predictors as well. The design and implementation of each of these modules is described in detail as well as the open-source R package, EWStools, that serves as the core of DEWS (Knowles, 2014).

How to Cite

Knowles, J. E. (2015). Of Needles and Haystacks: Building an Accurate Statewide Dropout Early Warning System in Wisconsin. Journal of Educational Data Mining, 7(3), 18–67. https://doi.org/10.5281/zenodo.3554725
Abstract 4216 | PDF Downloads 1347



Dropout Early Warning System, false-alarm, receiver-operating characteristic (ROC) metric, impute key predictors

AGUIAR, E., LAKKARAJU, H., BHANPURI, N., MILLER, D., YUHAS, B., AND ADDISON, K. L. 2015. Who, when, and why: A machine learning approach to prioritizing students at risk of not graduating high school on time. In Proceedings of the 2015 Learning Analytics and Knowledge Conference.

ALLENSWORTH, E. 2013. The use of ninth-grade early warning indicators to improve chicago schools. Journal of Education for Students Placed at Risk 1, 68–83.

BALFANZ, R. 2009. Putting middle grades students on the graduation path: A policy and practice brief. Tech. rep., National Middle School Association, Westerville, Ohio. http://www.amle.org/portals/0/pdf/research/research_from_the_ field/policy_brief_balfanz.pdf.

BALFANZ, R. AND HERZOG, L. 2006. Keeping middle grades students on-track to graduation: Initial analysis and implications. http://web.jhu.edu/CSOS/graduation-gap/edweek/ Balfanz_Herzog.ppt.

BALFANZ, R. AND IVER, D. M. 2006. Closing the mathematics achievement gap in high poverty middle schools: Enablers and constraints. Journal of Education for Students Placed At Risk 11, 2, 143–159.

BALFANZ, R. AND IVER, D. M. 2007. Preventing student disengagement and keeping students on the graduation path in the urban middle grade schools: Early identification and effective interventions. Educational Pyschologist 42, 4, 223–235.

BALFANZ, R. AND LEGTERS, N. 2004. Locating the dropout crisis: Which high schools produce the nation’s dropouts? Tech. Rep. 70, Center for Research on the Education of Students Placed At Risk,

Baltimore, MD. http://files.eric.ed.gov/fulltext/ED484525.pdf.

BOWERS, A. J. AND SPROTT, R. 2012a. Examining the multiple trajectories associated with dropping out of high school: A growth mixture model analysis. The Journal of Educational Research 105, 176–195.

BOWERS, A. J. AND SPROTT, R. 2012b. Why tenth graders fail to finish high school: A dropout typology latent class analysis. Journal of Education for Students Placed at Risk 17, 129–148.

BOWERS, A. J., SPROTT, R., AND TAFF, S. A. 2013. Do we know who will drop out? a review of the predictors of dropping out of high school: Precision, sensitivity, and specificity. The High School Journal 96, 77–100.

BREIMAN, L. 2001a. Random forests. Machine Learning 45, 1, 5–32.

BREIMAN, L. 2001b. Statistical modeling: The two cultures. Statistical Science 16, 199–231.

BURNHAM, K. AND ANDERSON, D. 2002. Model Selection and Multi Model Inference: A Practical Information-Theoretic Approach, Second ed. Springer, New York. ISBN 0-387-95364-7.

CARL, B., RICHARDSON, J. T., CHENG, E., KIM, H., AND MEYER, R. H. 2013. Theory and application of early warning systems for high school and beyond. Journal of Education for Students Placed at Risk 1, 29–49.

CHAPELLE, O., VAPNIK, V., BOUSQUET, O., AND MUKHERJEE, S. 2002. Choosing multiple parameters for support vector machines. Machine Learning 46, 1-3, 131–159.

CHATFIELD, C. 1995. Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society 158, 419–466.

DAHL, D. B. 2013. xtable: Export tables to LaTeX or HTML. R package version 1.7-1.

DASU, T. AND JOHNSON, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley-Interscience, New York.

DAVIS, M., HERZOG, L., AND LEGTERS, N. 2013. Organizing schools to address early warning indicators (ewis): Common practices and challenges. Journal of Education for Students Placed at Risk 1, 84–100.

DOWLE, M., SHORT, T., AND LIANOGLOU, S. 2013. data.table: Extension of data.frame for fast indexing, fast ordered joins,fast assignment, fast grouping and list columns. R package version 1.8.8.

EASTON, J. AND ALLENSWORTH, E. 2005. The on-track indicator as a predictor of high school graduation. Tech. rep., Consortium on Chicago School Research, Chicago. http://ccsr.uchicago. edu/publications/track-indicator-predictor-high-school-graduation.

EASTON, J. AND ALLENSWORTH, E. 2007. What matters for staying on-track and graduating in chicago public high schools: A close look at course grades, failures, and attendance in the freshman year. Tech. rep., Consortium on Chicago School Research, Chicago. http://ccsr.uchicago.edu/publications/ what-matters-staying-track-and-graduating-chicago-public-schools.

EFROYMSON, M. 1960. Multiple regression analysis. In Mathematical Methods for Digital Computers, A. Ralston and H. Wilf, Eds. Wiley, New York.

EVERS, A. 2012. Agenda 2017: Every child a graduate college and career ready. http://dpi.wi. gov/sprntdnt/pdf/agenda2017.pdf.

FRIEDMAN, J., HASTIE, T., AND TIBSHIRANI, R. 2000. Additive logistic regression: A statistical view of boosting. Annals of Statistics 28, 2, 337–374.

GELMAN, A., CARLIN, J. B., STERN, H. S., DUNSON, D. B., VEHTARI, A., AND RUBIN, D. B. 2013. Bayesian Data Analysis, Third ed. Chapman & Hall / CRC Texts in Statistical Science, London.

GELMAN, A. AND HILL, J. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge.

GLEASON, P. AND DYNARSKI, M. 2002. Do we know whom to serve? issues in using risk factors to identify dropouts. Journal of Education for Students Placed At Risk 7, 25–41. http://www. mathematica-mpr.com/publications/PDFs/dod-risk.pdf.

GR¨O MPING, U. 2006. Relative importance for linear regression in r: The package relaimpo. The Journal of Statistical Software 17.

HANCZAR, B., HUA, J., SIMA, C., WEINSTEIN, J., BITTNER, M., AND DOUGHERTY, E. 2010. Smallsample precision of roc-related estimates. Bioinformatics 26, 822–830.

HAND, D. J. 2009. Measuring classifier performance: A coherent alternative to the area under the roc curve. Machine Learning 77, 103–123.

HANLEY, J. AND MCNEIL, B. 1982. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 29–36.

HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 ed. Springer New York, New York. http://books.google.com/books/about/The_Elements_of_Statistical_ Learning.html?id=tVIjmNS3Ob8C.

HEPPEN, J. B. AND THERRIAULT, S. B. 2008. Developing early warning systems to identify potential high school dropouts. Tech. rep., National High School Center, Washington D.C. http://www.betterhighschools.org/pubs/documents/IssueBrief_ EarlyWarningSystemsGuide.pdf.

HLAVAC, M. 2013. stargazer: LaTeX code for well-formatted regression and summary statistics tables. Harvard University, Cambridge, USA. R package version 3.0.1.

HONAKER, J., KING, G., AND BLACKWELL, M. 2011. Amelia II: A program for missing data. Journal of Statistical Software 45, 7, 1–47.

INMON, W. 2005. Building the Data Warehouse, Fourth ed. John Wiley and Sons, New York. ISBN 978-1265-0645-3.

JAMES, G., WITTEN, D., HASTIE, T., AND TIBSHIRANI, R. 2013. An Introduction to Statistical Learning, 1 ed. Springer New York, New York.

JANOSZ, M., ARCHAMBAULT, I., MORIZOT, J., AND PAGANI, L. S. 2008. School engagement trajectories and their differential predictive relations to dropout. Journal of Social Issues 64, 1, 21–40.

JERALD, C. D. 2006. Identifying potential dropouts: Key lessons for building an early warning data system. Tech. rep., Achieve, Inc.,Washington D.C. http://www.jff.org/sites/default/ files/IdentifyingPotentialDropouts.pdf.

KEMPLE, J. J., SEGERITZ, M. D., AND STEPHENSON, N. 2013. Building on-track indicators for high school graduation and college readiness: Evidence from new york city. Journal of Education for Students Placed at Risk 1, 7–28.

KENNELLY, L. AND MONRAD, M. 2007. Approaches to dropout prevention: Heeding early warning signs with appropriate interventions. Tech. rep., National High School Center, Washington D.C. http://www.betterhighschools.org/docs/nhsc_ approachestodropoutprevention.pdf.

KIMBALL, R. AND ROSS, M. 2002. The Data Warehouse Toolkit, Second ed. JohnWiley and Sons, New York. ISBN 0-471-20024-7.

KNOWLES, J. AND WHITE, D. 2013. The wisconsin dropout early warning system action guide. Tech. rep., Wisconsin Department of Public Instruction, Madison, WI. http://wise.dpi.wi.gov/ files/wise/pdf/wi-dews-actionguide.pdf.

KNOWLES, J. E. 2014. EWStools: Tools for automating the testing and evaluation of education early warning system models. R package version 0.1.

KUHN, M. AND JOHNSON, K. 2013. Applied Predictive Modeling, First ed. Springer, New York. ISBN 978-1-4614-6848-6.

KUHN, M., WESTON, S., AND CODE FOR C5.0 BY R. QUINLAN, N. C. C. 2013. C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.0-15.

KUHN, M., WING, J., WESTON, S., WILLIAMS, A., KEEFER, C., ENGELHARDT, A., AND COOPER, T. 2013. caret: Classification and Regression Training. R package version 5.15-61.

KUNCHEVA, L. AND WHITAKER, C. 2003. Measures of diversity in classifier ensembles. Machine Learning 51, 181–207.

LOBO, J. M., JIMNEZ-VALVERDE, A., AND REAL, R. 2008. Auc: A misleading mesure of the performance of predictive distribution models. Global Ecology and Biogeography 17, 145–151.

MAYER, Z. AND KNOWLES, J. 2014. caretEnsemble: Framework for combining caret models into ensembles. R package version 1.0.

MUTH`E N, B. 2004. Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In The SAGE Handbook of Quantitative Methodology for the Social Sciences, D. Kaplan, Ed. Sage, Thousand Oaks, CA, 345–370.

NEILD, R. C., STONER-EBY, S., AND FURSTENBERG, F. 2008. Connecting entrance and departure: the transition to ninth grade and high school dropout. Education and Urban Society 50, 543–569.

PEDREGOSA, F., VAROQUAUX, G., GRAMFORT, A., MICHEL, V., THIRION, B., GRISEL, O., BLONDEL, M., PRETTENHOFER, P., WEISS, R., DUBOURG, V., VANDERPLAS, J., PASSOS, A., COURNAPEAU, D., BRUCHER, M., PERROT, M., AND DUCHESNAY, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.

R CORE TEAM. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

RIPLEY, B. AND LAPSLEY, M. 2012. RODBC: ODBC Database Access. R package version 1.3-6.

ROBIN, X., TURCK, N., HAINARD, A., TIBERTI, N., LISACEK, F., SANCHEZ, J.-C., AND MLLER, M. 2011. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics 12, 77.

RODERICK, M. 1993. The path to dropping out: Evidence for Intervention. Auburn House, Westport, CT.

RODERICK, M. AND CAMBURN, E. 1999. Risk and recovery from course failure in the early years of high school. American Educational Research Journal 36, 303–344. http: //www.eric.ed.gov/ERICWebPortal/search/detailmini.jsp?_nfpb=true& _&ERICExtSearch_SearchValue_0=EJ600524&ERICExtSearch_SearchType_0= no&accno=EJ600524.

RUMBERGER, R. W. 1995. Dropping out of middle school: A multilevel analysis of students and schools. American Educational Research Journal 32, 583–625. http://www.education. ucsb.edu/rumberger/internet%20pages/Papers/Rumberger--Droputs% 20from%20middle%20school%20(AERJ%201995).pdf.

SCULLEY, D., HOLT, G., GOLOVIN, D., DAVYDOV, E., PHILLIPS, T., EBNER, D., CHAUDHARY, V., AND YOUNG, M. 2014. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).

SINCLAIR, M., CHRISTENSON, S., AND THURLOW, M. 2005. Promoting school completion of urban secondary youth with emotional or behavioral disabilities. Exceptional Children 71, 465–482. http://www.iod.unh.edu/APEX%20Trainings/Tier%202% 20Manual/Additional%20Reading/3.%20Check%20and%20Connect.pdf.

SOLLICH, P. AND KROGH, A. 1996. Learning with ensembles: how overfitting can be useful. Advances in Neural Information Processing Systems 8, 190–196.

SWETS, J. 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293. THE STRATEGIC DATA PROJECT. 2012. The strategic data project toolkit version 1.1. http://www. gse.harvard.edu/˜pfpie/index.php/sdp/tools.

US DEPARTMENT OF EDUCATION. 2012a. Slds data use issue brief iii: Turning administrative data into research-ready longitudinal datasets. Tech. rep., US Department of Education, Washington, D.C. http://nces.ed.gov/programs/slds/pdf/Data-Use-Issue-Brief-3_ Research-Ready-Datasets.pdf.

US DEPARTMENT OF EDUCATION. 2012b. Slds data use issue brief iv: Techniques for analyzing longitudinal administrative data. Tech. rep., US Department of Education, Washington, D.C. http://nces.ed.gov/programs/slds/pdf/Data-Use-Issue-Brief-4_ Analysis-Techniques.pdf.

VAPNIK, V. 1998. Statistical Learning Theory. John Wiley and Sons, New York NY.

VENABLES, W. N. AND RIPLEY, B. D. 2002. Modern Applied Statistics with S, Fourth ed. Springer, New York. ISBN 0-387-95457-0.

VIVO, J. AND FRANCO, M. 2008. How does one assess the accuracy of academic success predictors? roc analysis applied to university entrance factors. International Journal of Mathematical Education in Science and Technology 39, 325–340.

WICKHAM, H. 2009. ggplot2: elegant graphics for data analysis. Springer New York. http://had. co.nz/ggplot2/book.

WICKHAM, H. 2014. Tidy data. The Journal of Statistical Software 59.

XIE, Y. 2013. knitr: A general-purpose package for dynamic report generation in R. R package version 1.1.

YOUDEN, W. 1950. Index for rating diagnostic tests. Cancer 3, 32–35.

ZWIEG, M. AND CAMPBELL, G. 1993. Receiver-operating characteristic (roc) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry 39, 561–577.