Effectively grouping learners in an online environment is a highly useful task. However, datasets used in this task often have large numbers of attributes of disparate types and different scales, which traditional clustering approaches cannot handle effectively. Here, a unique dissimilarity measure based on the random forest, which handles the stated drawbacks of more traditional clustering approaches, is presented. Additionally, arule-based method is proposed for interpreting the resulting learner segmentations. The approach was implemented on a real dataset of users of the CareerWISE online educational environment, designed to provide resilience training for women STEM doctoral students, and wasshown to find stable and meaningful groups of users.
How to Cite
High concentrations of long interspersed nuclear element sequence distinguish monoallelically
expressed genes. Proceedings of the National Academy of Sciences, 100 .17, 9940–9945.
ATZMUELLER, M. AND LEMMERICH, F. 2012. VIKAMINE--Open-Source Subgroup Discovery,
Pattern Mining, and Analytics, Machine Learning and Knowledge Discovery in Databases, 842--845
ATZMUELLER, M., AND PUPPE, F. 2006. SD–MAP–A Fast Algorithm for Exhaustive Subgroup
Discovery. In Proce. 10th European Conference on Principles and Practices of Knowledge Discovery in
Databases (PKDD 2006), 4213 in LNAI, 6-17. Berlin: Springer Verlag.
BEKKI, J.M., SMITH, M.L., BERNSTEIN, B.L., AND HARRISON, C.J. 2012. under review. Effects of
an Online Personal Resilience Training Program for Women in STEM Doctoral Programs.
BELLMAN, R.E. 1966. Adaptive control processes: a guided tour. New Jersey: Princeton University
BERNSTEIN, B. L. 2011. Managing barriers and building supports in science and engineering doctoral
programs: Conceptual underpinnings for a new online training program for women. Journal of Women and
Minorities in Science and Engineering. 17.1, 29-50.
BERNSTEIN, B. L. AND RUSSO, N. F. 2008. Explaining too few women in academic science and
engineering careers: A psychosocial perspective. In The psychology of women at work: Challenges and
solutions for our female workforce , M. Paludi, Ed., Praeger, Westport, 1 – 33 .
Breckenridge, James N, 2000. Validating cluster analysis: Consistent replication and symmetry,
Multivariate Behavioral Research, 35.2, 261-285.
BREIMAN, L. 2001.Random forests. Machine Learning, 45 .1, 5-32.
BREIMAN, L. 2002. RFtools–two-eyed algorithms. Invited talk at SIAM International Conference on
Data Mining . Available at: http://oz.berkeley.edu/users/breiman/siamtalk2003.pdf).
BREIMAN, L., AND CUTLER, A. 2003.Random forest manual v4.0. Technical report.
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., AND STONE, C.J. 1984. Classification and
Regression Trees. Wadsworth, California.
DIDAY, E., AND SIMON, J. C. 1976. Clustering Analysis, Digital Pattern Recognition, 10, 47–94.
DURAN, B.S., AND ODELL, P. L .1974. Cluster Analysis: A survey. Springer, New York.
FERGUSON, R. 2012. The state of learning analytics in 2012: A review and future challenges. Technical
Report KMI-12-01, Knowledge Media Institute, The Open University, UK.
http://kmi.open.ac.uk/publications/techreport/kmi-12-01, accessed June, 2012.
First European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 78–87.
HAN, J., PEI, J., YIN, Y. 2000. Mining frequent patterns without candidate generation. In Chen, W.,
Naughton, J., Bernstein, P.A., eds: 2000 ACM SIGMOID Intl. Conference on Management of data, ACM
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2009. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, Springer, New York.
HERHSKOVITZ, A., AND NACHMIAS, R. 2010. Online Persistence in Higher Education Web-supported
Courses. The Internet and Higher Education, 14.2, 98-106.
HINES, P.J., JANSY, B.R., AND MERVIS, J .2009. Adding a T to the Three R’s. Science, 323 .5910, 53-89.
HUBERT, L., AND ARABIE, P. 1985. Comparing partitions, Journal of classification,2.1,193-218.
JAIN, A.K., MURTY, M. N., AND FLYNN .P.J. 1999. Data clustering: a review. ACM Computing
Surveys .CSUR., 31 .3, 264-323.
JOHNSON, L., ADAMS, S., AND CUMMINS, M. 2012. The NMC Horizon Report: 2012 Higher
Education Edition, The New Media Consortium. Austin.
KAUFMAN, L., AND ROUSSEEUW, P. J. 1990. Finding groups in data: A introduction to cluster
analysis, New York: Wiley.
KLOSGEN, W. 1996. Explora: A multipattern and multistrategy discovery assistant. Advances in
Knowledge Discovery and Data Mining, MIT Press, 249–271.
LANGE,T. ,ROTH,V., BRAUN, M.L., AND BUHMANN, J.M. 2004, Stability-based validation of
clustering solutions, Neural computation ,16.6,1299-1323.
LAVRAC, N., KAVSEK, B., FLACH, P., AND TODOROVSKI, L. 2004. Subgroup Discovery with CN2-
SD. Journal of Machine Learning Research 5, 153–188.
LIU, B., XIA, Y., AND YU, P., S. 2000. Clustering Trough Decision Tree Construction. In Proceedings of
the ninth international conference on information and knowledge management , 20–29.
MEECE, J., L., AND HOLT, K. 1993. A pattern analysis of students' achievement goals. Journal of
Educational Psychology, 85.4, 582-590.
MERCERON, A. AND YACEF, K. 2003. A web-based tutoring tool with mining facilities to improve
learning and teaching. In Proceedings of the 11th International Conference on Artificial Intelligence in
Education, 201– 208.
MERCERON, A., AND YACEF, K. 2004. Clustering Students to Help Evaluate Learning. In Technology
Enhanced Learning , J.P. COURTIAT, C. DAVARAKIS, AND T. VILLEMUR .Eds. Kluwer, Toulouse,
MICHALSKI, R., STEPP, R. E., AND DIDAY, E. 1981. A recent advance in data analysis: clustering
objects into classes characterized by conjunctive concepts. In Progress in Pattern Recognition, L. N.
KANAL AND A. ROSENFIELD, Eds., New York, 33-56.
MILLIGAN, G.W., AND COOPER, M. 1986: A Study of the Comparability of External Criteria for
Hierarchical Cluster Analysis, Multivariate Behavioral Research, 21.4, 441-458.
MURTAGH, F. 1983. A survey of recent advances in hierarchical clustering algorithms}, The Computer
Journal, 26.4, 354-359.
NARCISS, S., PROSKE, A., AND KOERNDLE, H .2007. Promoting self-regulated learning in web-based
environments. Computers in Human Behavior, 23.3, 1126 – 1144.
NG, R. T. AND HAN, J. 1994. Efficient and effective clustering methods for spatial data mining. In
Proceedings of the Twentieth International Conference on Very Large Data Bases,144–154.
PARSON, L., HAQUE, E., AND LIU, H .2004. Subspace clustering for high dimensional data: A Review.
ACM SIGKDD Explorations Newsletter, 6.1, 90-105.
PERERA, D., KAY, J., KOPRINSKA, I., YACEF, K., ZAĎANE, O. R. 2009. Clustering and Sequential
Pattern Mining of Online Collaborative Learning Data. In IEEE Transaction on Knowledge and Data
Engineering, 21.6, 759-772.
R Development Core Team .2008. R: A language and environment for statistical computing. R Foundation
for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org.
ROMERO, C. AND VENTURA, S. 2007. Educational data mining: A survey from 1995 – 2005. Expert
Systems with Applications, 30, 135-146.
ROMERO, C. AND VENTURA, S. 2010. Educational data mining: A review of the state of the art. IEEE
Transactions on Systems, Man, and Cybernetcis – Part C: Applications and Reviews, 40.6, 601-618.
ROMERO, C., GONZALEZ, P., VENTURA, S., DEL JESUS, M.J., HERRERA, F.2009. Evolutionary
algorithms for subgroup discovery in e-learning: A practical application using Moodle data. In Expert
System with Application Journal, 36, 1632-1644.
ROMERO, C., VENTURA, S., AND GARCÍA, E. 2008. Data mining in course management systems:
Moodle case study and tutorial. Computers AND Education, 51, 368-384.
SELIGSON, D.B., HORVATH, S., SHI, T., YU, H., TZE. S., GRUNSTEIN, M., AND KURDISTANI,
S.K. 2005. Global histone modification pattern predict risk of prostate cancer recurrence. Nature, 435,
SHAVELSON, R.J .1979. Application of cluster analysis in educational research: looking for a needle in
haystack. British Educational Research Journal, 5.1, 45-53.
SHI, T., AND HORVATH, S. 2006. Unsupervised learning with random forest predictors. Journal of
Computational and Graphical Statistics, 15.1, 118-138.
SHI, T., SELIGSON, D., BELLDEGRUN, A.S., PALOTIE, A., AND HORVATH, S. 2005. Tumor
classification by tissue microarray profiling: Random forest clustering applied to renal cell carcinoma.
Modern Pathology, 18, 547-557.
SILBERSCHATZ, A., TUZHILIN, A. 1995. On Subjective Measures of Interestingness in Knowledge
Discovery, Proc. of the First Int'l Conference on Knowledge Discovery and Data Mining , Montreal,
SINGH, S., S., AND CHAUHAN, N.,C. 2011, K-Means v/s K-Medoids: A Comparative Study, National
Conference on Recent Trends inEngineering And Technology
TALAVERA, L. AND GAUDIOSO, E. 2004. Mining student data to characterize similar behavior groups
in unstructured collaboration spaces. Proceedings of the Artificial Intelligence in Computer Supported
Collaborative Learning Workshop at the ECAI 2004, 17-23.
TAN, P.N., STEINBACH, M., AND KUMAR, V .2006. Introduction to data mining, Pearson Addison
TIBSHIRANI, R., AND WALTHER, G. 2005. Cluster validation by prediction strength, Journal of
Computational and Graphical Statistics, 14.3, 511-528.
VALLE, R., AND DUFFY, M. 2009. Online learning: Learner characteristics and their approaches to
managing learning, Instructional Science, 37, 129-149.
WROBEL, S. 1997. An algorithm for multi-relational discovery of subgroups. In Proceedings of the
WROBEL, S. 2001. Inductive logic programming for knowledge discovery in databases. Relational Data
Mining, Springer, 74–101.
ZENKO, B., DZEROSKI, S., AND STRUYF, J. 2006. Learning predictive clustering rules. Proceedings of
the 4th International Workshop on Knowledge Discovery in Inductive Databases , F. BONCHI AND J.F.
BOULICAUT .Eds, Springer, Berlin, 234–250.
Authors who publish with this journal agree to the following terms:
- The Author retains copyright in the Work, where the term â€śWorkâ€ť shall include all digital objects that may result in subsequent electronic publication or distribution.
- Upon acceptance of the Work, the author shall grant to the Publisher the right of first publication of the Work.
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons 4.0 License (Attribution-Noncommercial-No Derivatives 4.0 International), or its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attributionâ€”other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- Noncommercialâ€”other users (including Publisher) may not use this Work for commercial purposes;
- No Derivative Worksâ€”other users (including Publisher) may not alter, transform, or build upon this Work,with the understanding that any of the above conditions can be waived with permission from the Author and that where the Work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisherâ€™s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisherâ€™s request, the Author agrees to furnish promptly to Publisher, at the Authorâ€™s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Authorâ€™s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Authorâ€™s breach of the representations and warranties contained in Paragraph 6 above, as well as any claim or proceeding relating to Publisherâ€™s use and publication of any content contained in the Work, including third-party content.