An Approach to Improve k-Anonymization Practices in Educational Data Mining

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Jun 27, 2024
Frank Stinar Zihan Xiong Nigel Bosch

Abstract

Educational data mining has allowed for large improvements in educational outcomes and understanding
of educational processes. However, there remains a constant tension between educational data mining advances
and protecting student privacy while using educational datasets. Publicly available datasets have
facilitated numerous research projects while striving to preserve student privacy via strict anonymization
protocols (e.g., k-anonymity); however, little is known about the relationship between anonymization
and utility of educational datasets for downstream educational data mining tasks, nor how anonymization
processes might be improved for such tasks. We provide a framework for strictly anonymizing educational
datasets with a focus on improving downstream performance in common tasks such as student
outcome prediction. We evaluate our anonymization framework on five diverse educational datasets with
machine learning-based downstream task examples to demonstrate both the effect of anonymization and
our means to improve it. Our method improves downstream machine learning accuracy versus baseline
data anonymization by 30.59%, on average, by guiding the anonymization process toward strategies that
anonymize the least important information while leaving the most valuable information intact.

How to Cite

Stinar, F., Xiong, Z., & Bosch, N. (2024). An Approach to Improve k-Anonymization Practices in Educational Data Mining. Journal of Educational Data Mining, 16(1), 61–83. https://doi.org/10.5281/zenodo.11056083
Abstract 204 | HTML Downloads 149 PDF Downloads 278

##plugins.themes.bootstrap3.article.details##

Keywords

student privacy, data sharing, machine learning

References
ALTMANN, A., TOLOŞI, L., SANDER, O., AND LENGAUER, T. 2010. Permutation importance: A corrected feature importance measure. Bioinformatics 26, 10 (Apr.), 1340–1347. DOI: https://doi.org/10.1093/bioinformatics/btq134.

BAKER, R. S., CORBETT, A. T., KOEDINGER, K. R., AND WAGNER, A. Z. 2004. Off-task behavior in the cognitive tutor classroom: When students “game the system”. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’04. Association for Computing Machinery, New York, NY, USA, 383–390. DOI: https://doi.org/10.1145/985692.985741.

BAKER, R. S., ESBENSHADE, L., VITALE, J., AND KARUMBAIAH, S. 2023. Using demographic data as predictor variables: A questionable choice. Journal of Educational Data Mining 15, 2 (Jun.), 22–52. DOI: https://doi.org/10.5281/zenodo.7702628.

BAKER, R. S. AND HAWN, A. 2022. Algorithmic bias in education. International Journal of Artificial Intelligence in Education 32, 1052–1092. DOI: https://doi.org/10.1007/s40593-021-00285-9.

BASU, A., NAKAMURA, T., HIDANO, S., AND KIYOMOTO, S. 2015. k-anonymity: Risks and the reality. In 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, Helsinki, Finland, 983–989. DOI: https://doi.org/10.1109/Trustcom.2015.473.

BEARDSLEY, M., SANTOS, P., HERNÁNDEZ-LEO, D., AND MICHOS, K. 2019. Ethics in educational technology research: Informing participants on data sharing risks. British Journal of Educational Technology 50, 3, 1019–1034. DOI: https://doi.org/10.1111/bjet.12781.

BENITEZ, K. AND MALIN, B. 2010. Evaluating re-identification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association 17, 2 (Mar.), 169–177. DOI: https://doi.org/10.1136/jamia.2009.000026.

BURATOVČ, I., MILIČEVIČ, M., AND ŽUBRINIČ, K. 2012. Effects of data anonymization on the data mining results. In 2012 Proceedings of the 35th International Convention MIPRO. IEEE, Piscataway, NJ, 1619–1623.

CARUSO, M., PEACOCK, C., SOUTHWELL, R., ZHOU, G., AND D’MELLO, S. 2022. Going deep and far: Gaze-based models predict multiple depths of comprehension during and one week following reading. In Proceedings of the 15th International Conference on Educational Data Mining, A. Mitrovic and N. Bosch, Eds. International Educational Data Mining Society, Durham, United Kingdom, 145–157. DOI: https://doi.org/10.5281/ZENODO.6852998.

CHENG, L., LIU, F., AND YAO, D. D. 2017. Enterprise data breach: Causes, challenges, prevention, and future directions: Enterprise data breach. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7, 5 (Sept.), 1–14. DOI: https://doi.org/10.1002/widm.1211.

CHEUNG, O. M., CLEMENTS, B. S., AND PECHMAN, E. M. 1997. Protecting the Privacy of Student Records: Guidelines for Educational Agencies. U.S. Dept. of Education, Office of Educational Research and Improvement, Educational Resources Information Center, Washington, DC.

CHICAIZA, J., CABRERA-LOAYZA, M. C., ELIZALDE, R., AND PIEDRA, N. 2020. Application of data anonymization in learning analytics. In Proceedings of the 3rd International Conference on Applications of Intelligent Systems, N. Petkov, N. Strisciuglio, and C. M. Travieso- González, Eds. APPIS 2020. Association for Computing Machinery, New York, NY, USA, 1–6. DOI: https://doi.org/10.1145/3378184.3378229.

CORTEZ, P. AND SILVA, A. 2008. Using data mining to predict secondary school student performance. In Proceedings of 5th Annual Future Business Technology Conference, J. L. Afonso, C. Cuoto, A. Lago Ferreiro, J. S. Martins, and A. Nogueiras Mel`endez, Eds. Vol. 5. EUROSIS-ETI, 5–12.

COUNCIL OF THE EUROPEAN UNION. 2016. General data protection regulation (GDPR) (l119, 4 may 2016, p. 1–88).

DANKAR, F. K. AND EMAM, K. E. 2013. Practicing differential privacy in health care: A review. Trans. Data Priv. 6, 35–67.

DOMINGO-FERRER, J. AND REBOLLO-MONEDERO, D. 2009. Measuring risk and utility of anonymized data using information theory. In Proceedings of the 2009 EDBT/ICDT Workshops. ACM, Saint- Petersburg Russia, 126–130. DOI: https://doi.org/10.1145/1698790.1698811.

DRACHSLER, H., HOEL, T., SCHEFFEL, M., KISMIHÓK, G., BERG, A., FERGUSON, R., CHEN, W., COOPER, A., AND MANDERVELD, J. 2015. Ethical and privacy issues in the application of learning analytics. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge. ACM, Poughkeepsie New York, 390–391. DOI: https://doi.org/10.1145/2723576.2723642.

EL EMAM, K. AND DANKAR, F. K. 2008. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association 15, 5 (Sept.), 627–637. DOI: https://doi.org/10.1197/jamia.M2716.

EL EMAM, K., DANKAR, F. K., ISSA, R., JONKER, E., AMYOT, D., COGO, E., CORRIVEAU, J.-P., WALKER, M., CHOWDHURY, S., VAILLANCOURT, R., ROFFEY, T., AND BOTTOMLEY, J. 2009. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association 16, 5 (09), 670–682. DOI: https://doi.org/10.1197/jamia.M3144.

FRIEDMAN, A. AND SCHUSTER, A. 2010. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Washington DC USA, 493–502. DOI: https://doi.org/10.1145/1835804.1835868.

HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2001. The elements of statistical learning. Springer series in statistics, 2 ed. Springer, Heidelberg, New York.

HUSSAIN, S., ABDULAZIZ DAHAN, N., BA-ALWI, F. M., AND RIBATA, N. 2018. Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science 9, 2 (Feb.), 447–459. DOI: https://doi.org/10.11591/ijeecs.v9.i2.pp447-459.

HUTT, S., BAKER, R. S., ASHENAFI, M. M., ANDRES-BRAY, J. M., AND BROOKS, C. 2022. Controlled outputs, full data: A privacy-protecting infrastructure for MOOC data. British Journal of Educational Technology 53, 4 (July), 756–775. DOI: https://doi.org/10.1111/bjet.13231.

IVANOVA, M., GROSSECK, G., AND HOLOTESCU, C. 2015. Researching data privacy models in eLearning. In 2015 International Conference on Information Technology Based Higher Education and Training (ITHET). IEEE, Lisbon, Portugal, 1–6. DOI: https://doi.org/10.1109/ITHET.2015.7218033.

JAIN, P., GYANCHANDANI, M., AND KHARE, N. 2016. Big data privacy: A technological perspective and review. Journal of Big Data 3, 1 (Dec.), 25:1–25. DOI: https://doi.org/10.1186/s40537-016-0059- y.

JIANG, L., BELITZ, C., AND BOSCH, N. 2024. Synthetic dataset generation for fairer unfairness research. In LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference. Association for Computing Machinery, 200–209. DOI: https://doi.org/10.1145/3636555.3636868.

KHALIL, M. AND EBNER, M. 2016. De-identification in learning analytics. Journal of Learning Analytics 3, 1 (Apr.), 129–138. DOI: https://doi.org/10.18608/jla.2016.31.8.

KITTO, K. AND KNIGHT, S. 2019. Practical ethics for building learning analytics. British Journal of Educational Technology 50, 6, 2855–2870. DOI: https://doi.org/10.1111/bjet.12868.

KUZILEK, J., HLOSTA, M., AND ZDRAHAL, Z. 2017. Open University Learning Analytics dataset. Scientific Data 4, 1 (Nov.), 170171. DOI: https://doi.org/10.1038/sdata.2017.171.

LEE, H., KIM, S., KIM, J. W., AND CHUNG, Y. D. 2017. Utility-preserving anonymization for health data publishing. BMC Medical Informatics and Decision Making 17, 1 (Dec.), 104. DOI: https://doi.org/10.1186/s12911-017-0499-0.

MACHANAVAJJHALA, A., KIFER, D., GEHRKE, J., AND VENKITASUBRAMANIAM, M. 2007. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 1 (Mar.), 3–es. DOI: https://doi.org/10.1145/1217299.1217302.

MARQUES, J. F. AND BERNARDINO, J. 2020. Analysis of data anonymization techniques. In International Conference on Knowledge Engineering and Ontology Development. KEOD, SciTePress, Setubal, Portugal, 235–241. DOI: https://doi.org/10.5220/0010142302350241.

MARSHALL, R., PARDO, A., SMITH, D., AND WATSON, T. 2022. Implementing next generation privacy and ethics research in education technology. D. Ladjal, S. Joksimovic, T. Rakotoarivelo, and C. Zhan, Eds. British Journal of Educational Technology 53, 4, 737–755. DOI: https://doi.org/10.1111/bjet.13224.

PARDO, A. AND SIEMENS, G. 2014. Ethical and privacy principles for learning analytics. British Journal of Educational Technology 45, 3, 438–450. DOI: https://doi.org/10.1111/bjet.12152.

POLONETSKY, J. AND JEROME, J. 2014. Student Data: Trust, Transparency, and the Role of Consent. Vol. 1. Future of Privacy Forum, Washington DC. DOI: https://doi.org/10.2139/ssrn.2628877.

PRASSER, F. AND KOHLMAYER, F. 2015. Putting statistical disclosure control into practice: The ARX data anonymization tool. In Medical Data Privacy Handbook, A. Gkoulalas-Divanis and G. Loukides, Eds. Springer International Publishing, Cham, 111–148. DOI: https://doi.org/10.1007/978-3-319- 23633-9 6.

PRINSLOO, P., SLADE, S., AND KHALIL, M. 2022. The answer is (not only) technological: Considering student data privacy in learning analytics. British Journal of Educational Technology 53, 4, 876–893. DOI: https://doi.org/10.1111/bjet.13216.

QINJIN JIA, YOUNG, M., YUNKAI XIAO, JIALIN CUI, CHENGYUAN LIU, RASHID, P., AND GEHRINGER, E. 2022. Insta-reviewer: A data-driven approach for generating instant feedback on students’ project reports. In Proceedings of the 15th International Conference on Educational Data Mining, A. Mitrovic and N. Bosch, Eds. International Educational Data Mining Society, Durham, United Kingdom, 5–16. DOI: https://doi.org/10.5281/ZENODO.6853099.

REIDENBERG, J. R. AND SCHAUB, F. 2018. Achieving big data privacy in education. Theory and Research in Education 16, 3 (Nov.), 263–279. DOI: https://doi.org/10.1177/1477878518805308.

ROMERO, C. AND VENTURA, S. 2020. Educational data mining and learning analytics: An updated survey. WIREs Data Mining and Knowledge Discovery 10, 3, e1355. DOI: https://doi.org/10.1002/widm.1355.

RUBEL, A. AND JONES, K. M. L. 2016. Student privacy in learning analytics: An information ethics perspective. The Information Society 32, 2, 143–159. DOI: https://doi.org/10.1080/01972243.2016.1130502.

SAMARATI, P. AND SWEENEY, L. 1998. Generalizing data to provide anonymity when disclosing information. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS ’98. Association for Computing Machinery, New York, NY, USA, 188. DOI: https://doi.org/10.1145/275487.275508.

SEPAS, A., BANGASH, A. H., ALRAOUI, O., EL EMAM, K., AND EL-HUSSUNA, A. 2022. Algorithms to anonymize structured medical and healthcare data: A systematic review. Frontiers in Bioinformatics 2, 984807. DOI: https://doi.org/10.3389/fbinf.2022.984807.

SIEMENS, G. 2013. Learning analytics: The emergence of a discipline. American Behavioral Scientist 57, 10, 1380–1400. DOI: https://doi.org/10.1177/0002764213498851.

SWEENEY, L. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05, 557–570. DOI: https://doi.org/10.1142/S0218488502001648.

U.S. DEPARTMENT OF EDUCATION. 1974. Family educational rights and privacy act (FERPA) (20 U.S.C. 1232g; 34 CFR part 99).

VAHDAT, M., ONETO, L., ANGUITA, D., FUNK, M., AND RAUTERBERG, M. 2015. Educational Process Mining (EPM): A learning analytics data set. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5NP5K.

WEI, R., TIAN, H., AND SHEN, H. 2018. Improving k-anonymity based privacy preservation for collaborative filtering. Computers & Electrical Engineering 67, 509–519. DOI: https://doi.org/10.1016/j.compeleceng.2018.02.017.

YACOBSON, E., FUHRMAN, O., HERSHKOVITZ, S., AND ALEXANDRON, G. 2020. De-identification is not enough to guarantee student privacy: De-anonymizing personal information from basic logs. In Companion Proceedings 10th International Conference on Learning Analytics and Knowledge (LAK20), V. Kovanovi´c, M. Scheffel, N. Pinkwart, and K. Verbert, Eds. 149–151.

YACOBSON, E., FUHRMAN, O., HERSHKOVITZ, S., AND ALEXANDRON, G. 2021. De-identification is insufficient to protect student privacy, or – what can a field trip reveal? Journal of Learning Analytics 8, 2 (Sept.), 83–92. DOI: https://doi.org/10.18608/jla.2021.7353.

ZEIDE, E. 2017. Unpacking student privacy. In Handbook of Learning Analytics, First ed., C. Lang, G. Siemens, A. Wise, and D. Gasevic, Eds. Society for Learning Analytics Research (SoLAR), New York, New York, 327–335. DOI: https://doi.org/10.18608/hla17.028.
Section
EDM 2024 Journal Track