An Approach to Improve k-Anonymization Practices in Educational Data Mining
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
Educational data mining has allowed for large improvements in educational outcomes and understanding
of educational processes. However, there remains a constant tension between educational data mining advances
and protecting student privacy while using educational datasets. Publicly available datasets have
facilitated numerous research projects while striving to preserve student privacy via strict anonymization
protocols (e.g., k-anonymity); however, little is known about the relationship between anonymization
and utility of educational datasets for downstream educational data mining tasks, nor how anonymization
processes might be improved for such tasks. We provide a framework for strictly anonymizing educational
datasets with a focus on improving downstream performance in common tasks such as student
outcome prediction. We evaluate our anonymization framework on five diverse educational datasets with
machine learning-based downstream task examples to demonstrate both the effect of anonymization and
our means to improve it. Our method improves downstream machine learning accuracy versus baseline
data anonymization by 30.59%, on average, by guiding the anonymization process toward strategies that
anonymize the least important information while leaving the most valuable information intact.
How to Cite
##plugins.themes.bootstrap3.article.details##
student privacy, data sharing, machine learning
BAKER, R. S., CORBETT, A. T., KOEDINGER, K. R., AND WAGNER, A. Z. 2004. Off-task behavior in the cognitive tutor classroom: When students “game the system”. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’04. Association for Computing Machinery, New York, NY, USA, 383–390. DOI: https://doi.org/10.1145/985692.985741.
BAKER, R. S., ESBENSHADE, L., VITALE, J., AND KARUMBAIAH, S. 2023. Using demographic data as predictor variables: A questionable choice. Journal of Educational Data Mining 15, 2 (Jun.), 22–52. DOI: https://doi.org/10.5281/zenodo.7702628.
BAKER, R. S. AND HAWN, A. 2022. Algorithmic bias in education. International Journal of Artificial Intelligence in Education 32, 1052–1092. DOI: https://doi.org/10.1007/s40593-021-00285-9.
BASU, A., NAKAMURA, T., HIDANO, S., AND KIYOMOTO, S. 2015. k-anonymity: Risks and the reality. In 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, Helsinki, Finland, 983–989. DOI: https://doi.org/10.1109/Trustcom.2015.473.
BEARDSLEY, M., SANTOS, P., HERNÁNDEZ-LEO, D., AND MICHOS, K. 2019. Ethics in educational technology research: Informing participants on data sharing risks. British Journal of Educational Technology 50, 3, 1019–1034. DOI: https://doi.org/10.1111/bjet.12781.
BENITEZ, K. AND MALIN, B. 2010. Evaluating re-identification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association 17, 2 (Mar.), 169–177. DOI: https://doi.org/10.1136/jamia.2009.000026.
BURATOVČ, I., MILIČEVIČ, M., AND ŽUBRINIČ, K. 2012. Effects of data anonymization on the data mining results. In 2012 Proceedings of the 35th International Convention MIPRO. IEEE, Piscataway, NJ, 1619–1623.
CARUSO, M., PEACOCK, C., SOUTHWELL, R., ZHOU, G., AND D’MELLO, S. 2022. Going deep and far: Gaze-based models predict multiple depths of comprehension during and one week following reading. In Proceedings of the 15th International Conference on Educational Data Mining, A. Mitrovic and N. Bosch, Eds. International Educational Data Mining Society, Durham, United Kingdom, 145–157. DOI: https://doi.org/10.5281/ZENODO.6852998.
CHENG, L., LIU, F., AND YAO, D. D. 2017. Enterprise data breach: Causes, challenges, prevention, and future directions: Enterprise data breach. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7, 5 (Sept.), 1–14. DOI: https://doi.org/10.1002/widm.1211.
CHEUNG, O. M., CLEMENTS, B. S., AND PECHMAN, E. M. 1997. Protecting the Privacy of Student Records: Guidelines for Educational Agencies. U.S. Dept. of Education, Office of Educational Research and Improvement, Educational Resources Information Center, Washington, DC.
CHICAIZA, J., CABRERA-LOAYZA, M. C., ELIZALDE, R., AND PIEDRA, N. 2020. Application of data anonymization in learning analytics. In Proceedings of the 3rd International Conference on Applications of Intelligent Systems, N. Petkov, N. Strisciuglio, and C. M. Travieso- González, Eds. APPIS 2020. Association for Computing Machinery, New York, NY, USA, 1–6. DOI: https://doi.org/10.1145/3378184.3378229.
CORTEZ, P. AND SILVA, A. 2008. Using data mining to predict secondary school student performance. In Proceedings of 5th Annual Future Business Technology Conference, J. L. Afonso, C. Cuoto, A. Lago Ferreiro, J. S. Martins, and A. Nogueiras Mel`endez, Eds. Vol. 5. EUROSIS-ETI, 5–12.
COUNCIL OF THE EUROPEAN UNION. 2016. General data protection regulation (GDPR) (l119, 4 may 2016, p. 1–88).
DANKAR, F. K. AND EMAM, K. E. 2013. Practicing differential privacy in health care: A review. Trans. Data Priv. 6, 35–67.
DOMINGO-FERRER, J. AND REBOLLO-MONEDERO, D. 2009. Measuring risk and utility of anonymized data using information theory. In Proceedings of the 2009 EDBT/ICDT Workshops. ACM, Saint- Petersburg Russia, 126–130. DOI: https://doi.org/10.1145/1698790.1698811.
DRACHSLER, H., HOEL, T., SCHEFFEL, M., KISMIHÓK, G., BERG, A., FERGUSON, R., CHEN, W., COOPER, A., AND MANDERVELD, J. 2015. Ethical and privacy issues in the application of learning analytics. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge. ACM, Poughkeepsie New York, 390–391. DOI: https://doi.org/10.1145/2723576.2723642.
EL EMAM, K. AND DANKAR, F. K. 2008. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association 15, 5 (Sept.), 627–637. DOI: https://doi.org/10.1197/jamia.M2716.
EL EMAM, K., DANKAR, F. K., ISSA, R., JONKER, E., AMYOT, D., COGO, E., CORRIVEAU, J.-P., WALKER, M., CHOWDHURY, S., VAILLANCOURT, R., ROFFEY, T., AND BOTTOMLEY, J. 2009. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association 16, 5 (09), 670–682. DOI: https://doi.org/10.1197/jamia.M3144.
FRIEDMAN, A. AND SCHUSTER, A. 2010. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Washington DC USA, 493–502. DOI: https://doi.org/10.1145/1835804.1835868.
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2001. The elements of statistical learning. Springer series in statistics, 2 ed. Springer, Heidelberg, New York.
HUSSAIN, S., ABDULAZIZ DAHAN, N., BA-ALWI, F. M., AND RIBATA, N. 2018. Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science 9, 2 (Feb.), 447–459. DOI: https://doi.org/10.11591/ijeecs.v9.i2.pp447-459.
HUTT, S., BAKER, R. S., ASHENAFI, M. M., ANDRES-BRAY, J. M., AND BROOKS, C. 2022. Controlled outputs, full data: A privacy-protecting infrastructure for MOOC data. British Journal of Educational Technology 53, 4 (July), 756–775. DOI: https://doi.org/10.1111/bjet.13231.
IVANOVA, M., GROSSECK, G., AND HOLOTESCU, C. 2015. Researching data privacy models in eLearning. In 2015 International Conference on Information Technology Based Higher Education and Training (ITHET). IEEE, Lisbon, Portugal, 1–6. DOI: https://doi.org/10.1109/ITHET.2015.7218033.
JAIN, P., GYANCHANDANI, M., AND KHARE, N. 2016. Big data privacy: A technological perspective and review. Journal of Big Data 3, 1 (Dec.), 25:1–25. DOI: https://doi.org/10.1186/s40537-016-0059- y.
JIANG, L., BELITZ, C., AND BOSCH, N. 2024. Synthetic dataset generation for fairer unfairness research. In LAK ’24: Proceedings of the 14th Learning Analytics and Knowledge Conference. Association for Computing Machinery, 200–209. DOI: https://doi.org/10.1145/3636555.3636868.
KHALIL, M. AND EBNER, M. 2016. De-identification in learning analytics. Journal of Learning Analytics 3, 1 (Apr.), 129–138. DOI: https://doi.org/10.18608/jla.2016.31.8.
KITTO, K. AND KNIGHT, S. 2019. Practical ethics for building learning analytics. British Journal of Educational Technology 50, 6, 2855–2870. DOI: https://doi.org/10.1111/bjet.12868.
KUZILEK, J., HLOSTA, M., AND ZDRAHAL, Z. 2017. Open University Learning Analytics dataset. Scientific Data 4, 1 (Nov.), 170171. DOI: https://doi.org/10.1038/sdata.2017.171.
LEE, H., KIM, S., KIM, J. W., AND CHUNG, Y. D. 2017. Utility-preserving anonymization for health data publishing. BMC Medical Informatics and Decision Making 17, 1 (Dec.), 104. DOI: https://doi.org/10.1186/s12911-017-0499-0.
MACHANAVAJJHALA, A., KIFER, D., GEHRKE, J., AND VENKITASUBRAMANIAM, M. 2007. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 1 (Mar.), 3–es. DOI: https://doi.org/10.1145/1217299.1217302.
MARQUES, J. F. AND BERNARDINO, J. 2020. Analysis of data anonymization techniques. In International Conference on Knowledge Engineering and Ontology Development. KEOD, SciTePress, Setubal, Portugal, 235–241. DOI: https://doi.org/10.5220/0010142302350241.
MARSHALL, R., PARDO, A., SMITH, D., AND WATSON, T. 2022. Implementing next generation privacy and ethics research in education technology. D. Ladjal, S. Joksimovic, T. Rakotoarivelo, and C. Zhan, Eds. British Journal of Educational Technology 53, 4, 737–755. DOI: https://doi.org/10.1111/bjet.13224.
PARDO, A. AND SIEMENS, G. 2014. Ethical and privacy principles for learning analytics. British Journal of Educational Technology 45, 3, 438–450. DOI: https://doi.org/10.1111/bjet.12152.
POLONETSKY, J. AND JEROME, J. 2014. Student Data: Trust, Transparency, and the Role of Consent. Vol. 1. Future of Privacy Forum, Washington DC. DOI: https://doi.org/10.2139/ssrn.2628877.
PRASSER, F. AND KOHLMAYER, F. 2015. Putting statistical disclosure control into practice: The ARX data anonymization tool. In Medical Data Privacy Handbook, A. Gkoulalas-Divanis and G. Loukides, Eds. Springer International Publishing, Cham, 111–148. DOI: https://doi.org/10.1007/978-3-319- 23633-9 6.
PRINSLOO, P., SLADE, S., AND KHALIL, M. 2022. The answer is (not only) technological: Considering student data privacy in learning analytics. British Journal of Educational Technology 53, 4, 876–893. DOI: https://doi.org/10.1111/bjet.13216.
QINJIN JIA, YOUNG, M., YUNKAI XIAO, JIALIN CUI, CHENGYUAN LIU, RASHID, P., AND GEHRINGER, E. 2022. Insta-reviewer: A data-driven approach for generating instant feedback on students’ project reports. In Proceedings of the 15th International Conference on Educational Data Mining, A. Mitrovic and N. Bosch, Eds. International Educational Data Mining Society, Durham, United Kingdom, 5–16. DOI: https://doi.org/10.5281/ZENODO.6853099.
REIDENBERG, J. R. AND SCHAUB, F. 2018. Achieving big data privacy in education. Theory and Research in Education 16, 3 (Nov.), 263–279. DOI: https://doi.org/10.1177/1477878518805308.
ROMERO, C. AND VENTURA, S. 2020. Educational data mining and learning analytics: An updated survey. WIREs Data Mining and Knowledge Discovery 10, 3, e1355. DOI: https://doi.org/10.1002/widm.1355.
RUBEL, A. AND JONES, K. M. L. 2016. Student privacy in learning analytics: An information ethics perspective. The Information Society 32, 2, 143–159. DOI: https://doi.org/10.1080/01972243.2016.1130502.
SAMARATI, P. AND SWEENEY, L. 1998. Generalizing data to provide anonymity when disclosing information. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS ’98. Association for Computing Machinery, New York, NY, USA, 188. DOI: https://doi.org/10.1145/275487.275508.
SEPAS, A., BANGASH, A. H., ALRAOUI, O., EL EMAM, K., AND EL-HUSSUNA, A. 2022. Algorithms to anonymize structured medical and healthcare data: A systematic review. Frontiers in Bioinformatics 2, 984807. DOI: https://doi.org/10.3389/fbinf.2022.984807.
SIEMENS, G. 2013. Learning analytics: The emergence of a discipline. American Behavioral Scientist 57, 10, 1380–1400. DOI: https://doi.org/10.1177/0002764213498851.
SWEENEY, L. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05, 557–570. DOI: https://doi.org/10.1142/S0218488502001648.
U.S. DEPARTMENT OF EDUCATION. 1974. Family educational rights and privacy act (FERPA) (20 U.S.C. 1232g; 34 CFR part 99).
VAHDAT, M., ONETO, L., ANGUITA, D., FUNK, M., AND RAUTERBERG, M. 2015. Educational Process Mining (EPM): A learning analytics data set. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5NP5K.
WEI, R., TIAN, H., AND SHEN, H. 2018. Improving k-anonymity based privacy preservation for collaborative filtering. Computers & Electrical Engineering 67, 509–519. DOI: https://doi.org/10.1016/j.compeleceng.2018.02.017.
YACOBSON, E., FUHRMAN, O., HERSHKOVITZ, S., AND ALEXANDRON, G. 2020. De-identification is not enough to guarantee student privacy: De-anonymizing personal information from basic logs. In Companion Proceedings 10th International Conference on Learning Analytics and Knowledge (LAK20), V. Kovanovi´c, M. Scheffel, N. Pinkwart, and K. Verbert, Eds. 149–151.
YACOBSON, E., FUHRMAN, O., HERSHKOVITZ, S., AND ALEXANDRON, G. 2021. De-identification is insufficient to protect student privacy, or – what can a field trip reveal? Journal of Learning Analytics 8, 2 (Sept.), 83–92. DOI: https://doi.org/10.18608/jla.2021.7353.
ZEIDE, E. 2017. Unpacking student privacy. In Handbook of Learning Analytics, First ed., C. Lang, G. Siemens, A. Wise, and D. Gasevic, Eds. Society for Learning Analytics Research (SoLAR), New York, New York, 327–335. DOI: https://doi.org/10.18608/hla17.028.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term “Work” shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercial—other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Works—other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.