Investigating Demographic Features and their Connection to Performance, Predictions, and Fairness in EDM Models
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
Although using demographic features for predictive models in Educational Data Mining (EDM) has to be considered very problematic from a fairness point of view and is currently critically discussed in the field, they are, in practice, frequently used without much deliberate thought. Their use and the discussion around their use mostly rely on the belief that they help achieve high model performance. In this paper, we theoretically and empirically assess the mechanisms that make them relevant for prediction and what this means for notions of fairness. Using four datasets for at-risk prediction, we find evidence that removing demographic features does not usually lead to a decrease in performance but also that we may sometimes be wrong in aiming to achieve the most accurate predictions. Furthermore, we show that models, nonetheless, place weight on these features when they are included -- highlighting the need to exclude them. Additionally, we show that even when demographic features are excluded, some fairness concerns relating to group fairness metrics may persist. These findings strongly highlight the need to know more about the causal mechanisms underlying the data and to think critically about demographic features in each specific setting -- emphasizing the need for more research on how demographic features influence educational attainment. Our code is available at: https://github.com/atschalz/edm.
How to Cite
##plugins.themes.bootstrap3.article.details##
fairness, categorical features, at-risk prediction, demographic features, sensitive features, algorithmic bias
Alturki, S., Hulpu, I., and Stuckenschmidt, H. 2022. Predicting academic outcomes: A survey from 2007 till 2018. Technology, Knowledge and Learning 27, 1, 275–307.
Amrieh, E. A., Hamtini, T., and Aljarah, I. 2016. Mining educational data to predict student’s academic performance using ensemble methods. International Journal of Database Theory and Application 9, 8, 119–136.
Andrus, M., Spitzer, E., Brown, J., and Xiang, A. 2021. What we can’t measure, we can’t understand: Challenges to demographic data procurement in the pursuit of fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. Association for Computing Machinery, New York, USA, 249–260.
Azhar, M., Nadeem, S., Naz, F., Perveen, F., and Sameen, A. 2014. Impact of parental education and socio-economic status on academic achievements of university students. European Journal of Psychological Research 1, 1, 1–9.
Baker, R. S., Esbenshade, L., Vitale, J., Karumbaiah, S., et al. 2023. Using demographic data as predictor variables: a questionable choice. Journal of Educational Data Mining 15, 2, 22–52.
Batool, S., Rashid, J., Nisar, M. W., Kim, J., Mahmood, T., and Hussain, A. 2021. A random forest students’ performance prediction (rfspp) model based on students’ demographic features. In 2021 Mohammad Ali Jinnah University International Conference on Computing (MAJICC). IEEE, Karachi, Pakistan, 1–4.
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. 2015. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery 8, 1, 014008.
Bird, S., Dudík, M., Edgar, R., Horn, B., Lutz, R., Milan, V., Sameki, M., Wallach, H., and Walker, K. 2021. Fairlearn: A toolkit for assessing and improving fairness in ai. Tech. rep., Microsoft, Tech. Rep. MSR-TR-2020-32.
Bloodhart, B., Balgopal, M. M., Casper, A. M. A., Sample McMeeking, L. B., and Fischer, E. V. 2020. Outperforming yet undervalued: Undergraduate women in stem. Plos one 15, 6, e0234685.
Castelnovo, A., Crupi, R., Greco, G., Regoli, D., Penco, I. G., and Cosentini, A. C. 2022. A clarification of the nuances in the fairness metrics landscape. Scientific Reports 12, 1, 4209.
Caton, S. and Haas, C. 2024. Fairness in machine learning: A survey. ACM Computing Surveys 56, 7, 1–38.
Cohausz, L., Kappenberger, J., and Stuckenschmidt, H. 2024. What fairness metrics can really tell you: A case study in the educational domain. In Proceedings of the 14th Learning Analytics and Knowledge Conference. Association for Computing Machinery, New York, USA, 792–799.
Cohausz, L., Tschalzev, A., Bartelt, C., and Stuckenschmidt, H. 2023. Investigating the importance of demographic features for edm-predictions. In 16th International Conference on Educational Data Mining. International Educational Data Mining Society, Bengaluru, India.
Cortez, P. and Silva, A. M. G. 2008. Using data mining to predict secondary school student performance. Tech. rep., EUROSIS-ETI.
Daud, A., Aljohani, N. R., Abbasi, R. A., Lytras, M. D., Abbas, F., and Alowibdi, J. S. 2017. Predicting student performance using advanced learning analytics. In Proceedings of the 26th international conference on world wide web companion. nternational World Wide Web Conferences Steering Committee, Republic and Canton of GenevaSwitzerland, 415–421.
Deho, O. B., Zhan, C., Li, J., Liu, J., Liu, L., and Duy Le, T. 2022. How do the existing fairness metrics and unfairness mitigation algorithms contribute to ethical learning analytics? British Journal of Educational Technology 53, 4, 822–843.
Delahoz-Dominguez, E., Zuluaga, R., and Fontalvo-Herrera, T. 2020. Dataset of academic performance evolution for engineering students. Data in brief 30, 105537.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. Association for Computing Machinery, New York, USA, 214–226.
Grinsztajn, L., Oyallon, E., and Varoquaux, G. 2022. Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems 35, 507–520.
Hill, C., Corbett, C., and St Rose, A. 2010. Why so few? Women in science, technology, engineering, and mathematics. ERIC, 1111 Sixteenth Street NW, Washington, DC 20036.
Hoffait, A.-S. and Schyns, M. 2017. Early detection of university students with potential difficulties. Decision Support Systems 101, 1–11.
Hu, Q. and Rangwala, H. 2020. Towards fair educational data mining: A case study on detecting at-risk students. In 13th International Conference on Educational Data Mining. International Educational Data Mining Society, Ifrane, Morocco.
Jha, N. I., Ghergulescu, I., and Moldovan, A.-N. 2019. Oulad mooc dropout and result prediction using ensemble, deep learning and regression techniques. In Proceedings of the 11th International Conference on Computer Supported Education. Springer Nature, Heraklion, Crete, Greece, 154–164.
Khasanah, A. U. et al. 2017. A comparative study to predict student’s performance using educational data mining techniques. In IOP Conference Series: Materials Science and Engineering. Vol. 215. IOP Publishing, Bristol, UK, 012036.
Kizilcec, R. F. and Lee, H. 2022. Algorithmic fairness in education. In The ethics of artificial intelligence in education. Routledge, New York, USA, 174–202.
Kuzilek, J., Hlosta, M., and Zdrahal, Z. 2017. Open university learning analytics dataset. Scientific data 4, 1, 1–8.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6, 1–35.
Micci-Barreca, D. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3, 1, 27–32.
Miguéis, V. L., Freitas, A., Garcia, P. J., and Silva, A. 2018. Early segmentation of students according to their academic performance: A predictive modelling approach. Decision Support Systems 115, 36–51.
Paquette, L., Ocumpaugh, J., Li, Z., Andres, A., and Baker, R. 2020. Who’s learning? using demographics in edm research. Journal of Educational Data Mining 12, 3, 1–30.
Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. 2022. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics 37, 5, 1–22.
Pearl, J. 2009. Causality. Cambridge university press, Cambridge, UK.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. 2018. Catboost: unbiased boosting with categorical features. In Advances in neural information processing systems. IEEE, Montreal, Canada.
Salkind, N. J. 2010. Encyclopedia of research design. Vol. 1. sage, Thousand Oaks, California, USA.
Shwartz-Ziv, R. and Armon, A. 2022. Tabular data: Deep learning is not all you need. Information Fusion 81, 84–90.
Sweeney, M., Lester, J., Rangwala, H., Johri, A., et al. 2016. Next-term student performance prediction: A recommender systems approach. Journal of Educational Data Mining 8, 1, 22–51.
Tomasevic, N., Gvozdenovic, N., and Vranes, S. 2020. An overview and comparison of supervised data mining techniques for student exam performance prediction. Computers & education 143, 103676.
Trstenjak, B. and Donko, D. 2014. Determining the impact of demographic features in predicting student success in croatia. In 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE, Opatija, Croatia, 1222–1227.
Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., and Deng, S.-H. 2019. Hyperparameter optimization for machine learning models based on bayesian optimization. Journal of Electronic Science and Technology 17, 1, 26–40.
Yu, R., Lee, H., and Kizilcec, R. F. 2021. Should college dropout prediction models include protected attributes? In Proceedings of the eighth ACM conference on learning@ Scale. Association for Computing Machinery, New York, USA, 91–100.
Zhao, Y., Xu, Q., Chen, M., and Weiss, G. 2020. Predicting student performance in a master’s program in data science using admissions data. In 13th International Conference on Educational Data Mining. International Educational Data Mining Society, Ifrane, Morocco.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.