Statistical Consequences of using Multi-armed Bandits to Conduct Adaptive Educational Experiments



Published Jun 18, 2019
Anna Rafferty Huiji Ying Joseph Williams


Randomized experiments can provide key insights for improving educational technologies, but many students may experience conditions associated with inferior learning outcomes in these experiments. Multiarmed bandit (MAB) algorithms can address this issue by accumulating evidence from the experiment as it runs and modifying the experimental design to assign more helpful conditions to a greater proportion of future students. Using simulations, we explore the statistical impact of using MAB algorithms for experiment design, focusing on the tradeoff between acquiring statistically reliable information from the experiment and benefits to students. We consider how temporal biases in patterns of student behavior may impact the results of MAB experiments, and model data from ten previous educational experiments to demonstrate potential impacts of MAB assignment. Results suggest that MAB experiments can lead to much higher average benefits to students than traditional experimental designs, although at least twice as many participants are needed for acceptable statistical power. Using an optimistic prior distribution for the MAB algorithm mitigates the loss in power to some extent, without significantly reducing benefits to students. Additionally, longer experiments with MAB assignment still assign fewer students to a less effective condition than typical practice of a shorter experiment followed by choosing one condition for all future students. Yet, MAB assignment does increase false positive rates, especially if there are temporal biases in when students enter the experiment. Caution must thus be used when interpreting results from MAB assignment in cases where students can choose when to participate in the experiment. Overall, in scenarios where student characteristics do not vary over time, MAB experimental designs can be beneficial for students and effective for reliably determining which of two differing conditions is better given large sample sizes.

How to Cite

Rafferty, A., Ying, H., & Williams, J. (2019). Statistical Consequences of using Multi-armed Bandits to Conduct Adaptive Educational Experiments. JEDM | Journal of Educational Data Mining, 11(1), 47-79. Retrieved from
Abstract 297 | PDF Downloads 190



experimental design, educational experiment, simulation, statistical hypothesis testing, adaptive experimentation, multi-armed bandits

AGRAWAL, S. AND GOYAL, N. 2012. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, S. Mannor, N. Srebro, and R. C. Williamson, Eds. Vol. 23. PMLR, Edinburgh, Scotland, 39.1–39.26.

AGRAWAL, S. AND GOYAL, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on International Conference on Machine Learning, S. Dasgupta and D. McAllester, Eds. Vol. 28. JMLR, 127–135.

ATKINSON, A. C. 2014. Selecting a biased-coin design. Statistical Science 29, 1, 144–163.

AUDIBERT, J.-Y. AND BUBECK, S. 2010. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Annual Conference on Learning Theory. 41–53.

BASSLER, D., BRIEL, M., MONTORI, V. M., LANE, M., GLASZIOU, P., ZHOU, Q., HEELS-ANSDELL, D., WALTER, S. D., GUYATT, G. H., GROUP, S.-. S., ET AL. 2010. Stopping randomized trials early for benefit and estimation of treatment effects: Systematic review and meta-regression analysis. JAMA 303, 12, 1180–1187.

BESBES, O., GUR, Y., AND ZEEVI, A. 2014. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds. Curran Associates, Inc., 199–207.

BOWDEN, J. AND TRIPPA, L. 2017. Unbiased estimation for response adaptive clinical trials. Statistical Methods in Medical Research 26, 5, 2376–2388.

BUTTON, K. S., IOANNIDIS, J. P., MOKRYSZ, C., NOSEK, B. A., FLINT, J., ROBINSON, E. S., AND MUNAF`O, M. R. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14, 5, 365–376.

CAVAGNARO, D. R., MYUNG, J. I., PITT, M. A., AND KUJALA, J. V. 2010. Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science. Neural Computation 22, 4, 887–905.

CHAPELLE, O. AND LI, L. 2011. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2249–2257.

CHOW, S.-C., WANG, H., AND SHAO, J. 2007. Sample size calculations in clinical research. CRC Press, Boca Raton, FL.

CLEMENT, B., ROY, D., OUDEYER, P.-Y., AND LOPES, M. 2015. Multi-armed bandits for intelligent tutoring systems. Journal of Educational Data Mining 7, 20–48.

COHEN, J. 1988. Statistical power analysis for the behavioral sciences, 2 ed. Lawrence Erlbaum Associates, Mahwah, NJ.

DEMETS, D. L. AND LAN, K. 1994. Interim analysis: the alpha spending function approach. Statistics in Medicine 13, 13-14, 1341–1352.

DUAN, L. AND HU, F. 2009. Doubly adaptive biased coin designs with heterogeneous responses. Journal of Statistical Planning and Inference 139, 9, 3220–3230.

EISELE, J. R. AND WOODROOFE, M. B. 1995. Central limit theorems for doubly adaptive biased coin designs. The Annals of Statistics 23, 1, 234–254.

ERRAQABI, A., LAZARIC, A., VALKO, M., BRUNSKILL, E., AND LIU, Y.-E. 2017. Trading off rewards and errors in multi-armed bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu, Eds. Vol. 54. PMLR, 709–717.

GELMAN, A. AND CARLIN, J. 2014. Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science 9, 6, 641–651.

HU, F. AND ROSENBERGER, W. F. 2003. Optimality, variability, power: evaluating response-adaptive randomization procedures for treatment comparisons. Journal of the American Statistical Association 98, 463, 671–678.

HU, F. AND ROSENBERGER, W. F. 2006. The theory of response-adaptive randomization in clinical trials. Vol. 525. John Wiley & Sons, Hoboken, NJ.

JENNISON, C. AND TURNBULL, B. W. 2005. Meta-analyses and adaptive group sequential designs in the clinical development process. Journal of Biopharmaceutical Statistics 15, 4, 537–558.

KAUFMANN, E., CAPP´E, O., AND GARIVIER, A. 2016. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research 17, 1, 1–42.

KULESHOV, V. AND PRECUP, D. 2014. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028.

LAI, T. L. AND ROBBINS, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1, 4–22.

LAN, A. S. AND BARANIUK, R. G. 2016. A contextual bandits framework for personalized learning action selection. In Proceedings of the Ninth International Conference on Educational Data Mining, T. Barnes, M. Chi, and M. Feng, Eds. 424–429.

LANGFORD, J. AND ZHANG, T. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Curran Associates, Inc., 817–824.

LI, L., CHU, W., LANGFORD, J., AND SCHAPIRE, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. ACM, 661–670.

LIU, Y.-E., MANDEL, T., BRUNSKILL, E., AND POPOVIC, Z. 2014. Trading off scientific knowledge and user learning with multi-armed bandits. In Proceedings of the 7th International Conference on Educational Data Mining, J. Stamper, Z. Pardos, M. Mavrikis, and B. McLaren, Eds. 161–168.

MANSOURNIA, M. A. AND ALTMAN, D. G. 2016. Inverse probability weighting. British Medical Journal 352, i189.

MU, T., WANG, S., ANDERSEN, E., AND BRUNSKILL, E. 2018. Combining adaptivity with progression ordering for intelligent tutoring systems. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale. ACM, 15:1–15:4.

RADLINSKI, F., KLEINBERG, R., AND JOACHIMS, T. 2008. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, A.McCallum and S. Roweis, Eds. ACM, 784–791.

SCOTT, S. L. 2010. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26, 6, 639–658.

SEGAL, A., DAVID, Y. B., WILLIAMS, J. J., GAL, K., AND SHALOM, Y. 2018. Combining difficulty ranking with multi-armed bandits to sequence educational content. In Proceedings of the 19th International Conference on Artificial Intelligence in Education, C. Penstein Ros´e, R. Martnez-

Maldonado, U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, and B. du Boulay, Eds. Springer, 317–321.

SELENT, D., PATIKORN, T., AND HEFFERNAN, N. 2016. ASSISTments dataset from multiple randomized controlled experiments. In Proceedings of the Third ACM Conference on Learning at Scale. ACM, 181–184.

TANG, L., JIANG, Y., LI, L., AND LI, T. 2014. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems. ACM, 73–80.

TANG, L., ROSALES, R., SINGH, A., AND AGARWAL, D. 2013. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1587–1594.

WELCH, B. L. 1938. The significance of the difference between two means when the population variances are unequal. Biometrika 29, 3/4, 350–362.


HEFFERNAN, N. 2016. Axis: Generating explanations at scale with learnersourcing and machine learning. In Proceedings of the Third ACM Conference on Learning at Scale. ACM, 379–388.

WILLIAMS, J. J., RAFFERTY, A. N., TINGLEY, D., ANG, A., LASECKI, W. S., AND KIM, J. 2018. Enhancing online problems through instructor-centered tools for randomized experiments. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 207:1–207:12.

XU, J., XING, T., AND VAN DER SCHAAR, M. 2016. Personalized course sequence recommendations. IEEE Transactions on Signal Processing 64, 20, 5340–5352.