Random forest models reveal academic and financial factors outweigh demographics in predicting completion of a year-round veterinary program

Sarah E. Hooper Department of Biomedical Sciences, School of Veterinary Medicine, Ross University, Basseterre, Saint Kitts and Nevis

Search for other papers by Sarah E. Hooper in
Current site
Google Scholar
PubMed
Close
 PhD https://orcid.org/0000-0003-1150-2500
,
Natalie Ragland Department of Biomedical Sciences, School of Veterinary Medicine, Ross University, Basseterre, Saint Kitts and Nevis
Cooper Medical School, Rowan University, Camden, NJ

Search for other papers by Natalie Ragland in
Current site
Google Scholar
PubMed
Close
 DVM https://orcid.org/0000-0002-7959-4803
, and
Elpida Artemiou Department of Clinical Sciences, School of Veterinary Medicine, Ross University, Basseterre, Saint Kitts and Nevis
School of Veterinary Medicine, Texas Tech University, Amarillo, TX

Search for other papers by Elpida Artemiou in
Current site
Google Scholar
PubMed
Close
 PhD https://orcid.org/0000-0001-7308-2316

Abstract

OBJECTIVE

The purpose of this study was to develop random forest classifier models (a type of supervised machine learning algorithm) that could (1) predict students who will or will not complete the DVM degree requirements and (2) identify the top predictors for academic success and completion of the DVM degree.

METHODS

The study utilized Ross University School of Veterinary Medicine student records from 2013 to 2022. Twenty-four variables encompassing demographic (eg, age, race), academic (eg, grade point average), and financial aid (eg, outstanding balances) data were assessed in 11 cross-validated random forest machine learning models. One model was built assessing all years of data and 10 individual models were developed for each enrollment year to compare how the top predictors of success varied among the years.

RESULTS

Consistently, only academic and financial factors were identified as being features of importance (predictors) in all models. Demographic factors such as race were not important for predicting student success. All models performed very well to excellently based on multiple performance metrics including accuracy, ranging from 96.1% to 99%, and the areas under the receiver operating characteristic curves, ranging from 98.1% to 99.9%.

CONCLUSIONS

The random forest algorithm is a powerful machine learning prediction model that performs well with veterinary student academic records and is customizable such that variables important to each veterinary school’s student population can be assessed.

CLINICAL RELEVANCE

Identifying predictors of success as well as at-risk students is essential for providing targeted curricular interventions to increase retention and achieve timely completion of a DVM degree.

Abstract

OBJECTIVE

The purpose of this study was to develop random forest classifier models (a type of supervised machine learning algorithm) that could (1) predict students who will or will not complete the DVM degree requirements and (2) identify the top predictors for academic success and completion of the DVM degree.

METHODS

The study utilized Ross University School of Veterinary Medicine student records from 2013 to 2022. Twenty-four variables encompassing demographic (eg, age, race), academic (eg, grade point average), and financial aid (eg, outstanding balances) data were assessed in 11 cross-validated random forest machine learning models. One model was built assessing all years of data and 10 individual models were developed for each enrollment year to compare how the top predictors of success varied among the years.

RESULTS

Consistently, only academic and financial factors were identified as being features of importance (predictors) in all models. Demographic factors such as race were not important for predicting student success. All models performed very well to excellently based on multiple performance metrics including accuracy, ranging from 96.1% to 99%, and the areas under the receiver operating characteristic curves, ranging from 98.1% to 99.9%.

CONCLUSIONS

The random forest algorithm is a powerful machine learning prediction model that performs well with veterinary student academic records and is customizable such that variables important to each veterinary school’s student population can be assessed.

CLINICAL RELEVANCE

Identifying predictors of success as well as at-risk students is essential for providing targeted curricular interventions to increase retention and achieve timely completion of a DVM degree.

Introduction

Veterinary medical education programs are academically intensive, and some students may experience academic difficulties resulting in delayed graduation, academic dismissal,13 or withdrawal due to poor academic performance.4 Specifically, veterinary students report difficulties in managing a heavy workload with complex content and experience additional nonacademic challenges such as financial worries and constraints, struggles with high expectations, declining physical health, challenging peer relationships, isolation, and lack of support.5,6 Evidence strongly indicates that when at-risk students are identified at the early stages of their academic career, targeted interventions can be most effective and have a long-lasting impact in improving academic achievement.7,8 Simple early interventions such as having disadvantaged high school students read normative experiences from university students about how they adjusted to college or how these students shared similar challenges (ie, not fitting in or making friends) increased full-time enrollment rates throughout students’ college pursuits as well as increasing students’ grade point averages (GPAs).9

Current research in veterinary medical education focuses on preadmissions data to inform admissions committees on how to best select applicants who will succeed in completing and graduating the veterinary curriculum, reducing the educational debt to the prospective applicant while decreasing the cost to the institution.10 Specifically, the limited number of studies assessing preadmission predictors of veterinary academic success has found that undergraduate cumulative GPA, science GPA,1,2,10,11 the graduate record examination (GRE),1113 and the demographic variable of age14 can all directly positively correlate with veterinary school grades.

In recent years, many veterinary schools have begun to embrace holistic admission review processes, defined by the American Association of Medical Colleges as “mission-aligned admissions or selection processes that take into consideration applicants’ experiences, attributes, and academic metrics as well as the value an applicant would contribute to learning, practice, and teaching.”15 The adoption of holistic admission expands the applicant pool and, as such, the need to gain further appreciation surrounding predictors of success in veterinary academia.

It has been proposed that machine learning (ML), a subfield of AI, is an untapped potential methodology for college administrators and educators to identify students at risk of failing academically during the preclinical training years.16 While many ML algorithms exist, random forests have been suggested to be particularly useful in higher education, as they combine the outputs of many different decision trees into 1 overall optimal prediction model.1618 A combination of accurate prediction models and successful interventions could help support the development of a diverse veterinary profession.

The overarching goal of this study was to employ ML to help identify factors that predict DVM candidates enrolled in a year-round veterinary professional program who are at risk of academic failure. By identifying the factors most important for predicting academic success and failure, future studies can develop and assess targeted interventions, with prospective studies designed to determine whether these interventions successfully help identified at-risk students.19 A combination of accurate prediction models and successful interventions may contribute to the development of a diverse veterinary profession.

The results of a prior study16 that used simulated educational data suggest that the random forest ML model is a robust algorithm for veterinary student data, in part due to being resilient to outlier data and handling nonlinear data. Therefore, we hypothesized that random forest classifier ML models could be used to successfully predict students who will or will not graduate or have graduated from a year-round veterinary program.

The objectives of this study were to (1) build a random forest model that could be used to identify the most important pre- and postadmission determinants for predicting student graduation or withdrawal from the program and (2) assess whether these most important predictors change over a period of 10 years.

Methods

The Ross University School of Veterinary Medicine program overview

The Ross University School of Veterinary Medicine (RUSVM) curriculum is year-round and 40 months in duration. Admission to the program can occur in January, May, or August, and therefore, the year-round academic calendar extends across three 15-week semesters (fall, winter, and summer). Students are required to successfully complete 7 preclinical semesters on the RUSVM campus located on the island of Saint Kitts in the West Indies. Subsequently, students complete 3 clinical semesters at an affiliated AVMA-accredited school.

Selection of data

The RUSVM student data were sourced from an internal database, OutReachIQ, and data included all students admitted at RUSVM from January 2013 through May 2022 with curricular outcomes, ensuring data integrity. All demographic data within OutReachIQ were obtained via the standardized application process through the Veterinary Medical College Application Service. The study was approved by the RUSVM Institutional Review Board (No. 22-03-XP).

Preparation and anonymization of data for analysis

All data preparation, model development, and model validation were completed as recommended by Hooper et al16 by use of Spyder (version 5.1.5; Spyder Website Contributors), and Python (version 3.9.12; Python Software Foundation) ran within an Anaconda Distribution (version 2022.10)–designated environment.

Selection and anonymization of student records

Students enrolled in a combined dual-degree program (eg, DVM–Master of Science) were excluded from the dataset, as the structure of the program can be different than the DVM curriculum. All DVM student records were assigned a randomized identification by use of the Python package NumPy (version 1.264.0) prior to anonymizing the dataset. Also, the numerical age at admission was categorized into the following age ranges to maintain the students’ confidentiality while concurrently ensuring an even distribution of students among the age categories: 17 to 22, 23 to 24, 25 to 26, 27 to 29, and 30 and over. Four categorical values were established for the country of origin (Canada, US, other, and not reported), taking into consideration that ≤ students identified from countries outside of North America, once again ensuring confidentiality.

Preparation of predictors

States were grouped into categorical variables based on geographic regions as defined by the US Census Bureau.20 All categorical variables were one-hot encoded by use of the scikit-learn Python package (version 1.2.0).21 All raw and model demographic variables are summarized in Supplementary Tables S1 and S2.

Students admitted to RUSVM for spring, summer, or fall semesters were grouped by year for the admission date. The RUSVM veterinary preparation program (Vet Prep) was assigned 0, and the preclinical students were assigned 1 through 7 according to their semester of enrollment. Ross University School of Veterinary Medicine students enrolled in their clinical year were all assigned the numerical value 10, as the database does not differentiate between the clinical semesters 8 through 10. Graduate record examination scores prior to August 2011 were scored on a scale from 200 to 800, and these were converted to the new GRE scale22 of 130 to 170. Starting in 2020, the GRE became optional for admission to RUSVM; therefore, this variable was not included in the overall model nor the individual models for years 2020 to 2022. A leave of absence (LOA), defined as a minimum of 1 day and maximum of 14 days out of classes, was considered 1 absence, and each LOA regardless of the length was cumulatively added. Awarded scholarships were captured cumulatively, considering that we identified over 182 different scholarships being represented. All raw and model academic variables are summarized in Supplementary Table S3. Students on federal student aid through the Free Application for Federal Student Aid (FAFSA) form with/without a financial aid or registration hold on their account were designated a binary variable, with no represented by 0 and yes represented by 1. All raw and model financial aid variables are summarized in Supplementary Table S4.

Target variable creation

After the dataset was anonymized and cleaned, a categorical (binary) target variable “graduated/active student or not graduated” was created for all student records. Accordingly, the numerical value 0 represented a student which we defined as being enrolled at RUSVM but not having completed the DVM program, while 1 designated a student who graduated with their DVM degree from RUSVM’s accelerated program or was an actively enrolled student in RUSVM’s DVM program. We did not include students on an LOA, since we did not know whether they would return to the DVM program and/or graduate.

We accounted for the severe outcome imbalance observed due to the large percentage of admitted students graduating versus the very low percentage of admitted students not graduating by applying the synthetic minority oversampling technique to balance the dataset. Correlations for all variables were assessed by use of the Pearson correlation coefficient and Cramér V. Variables with a Pearson correlation coefficient or Cramér V over 0.7 were reduced to 1 variable. Citizenship and origin variables were not reduced, as our student population has a high percentage of minority students for which their citizenship may differ from their origin country. Upon creation of the target variable, a total of 11 individual datasets were formed: 1 dataset containing all student records for a model assessing all years concurrently and a dataset for each individual year. Each individual dataset was divided into training and testing sets by use of a 70:30 ratio.

Random forest classification models

A total of 11 random forest classifier models were developed by use of the scikit-learn Python package.21 The model performance was k-fold cross-validated (k = 10) by use of the testing dataset. Subsequently after creation each of the 11 random forest classification models, the hyperparameters were tuned, as previously described.16 The performance of the default parameter random forest classification model was compared to the random forest classification model with hyperparameter tuning. The best model was selected for assessing the most important predictor features based on the Gini impurity criterion, which is considered the most common method for determining features of importance, as explained in Hooper et al.15 All selected values for each year model are summarized in Supplementary Table S5.

Performance metrics

For each model, overall accuracy, sensitivity, specificity, precision, F1 score, and the area under the curve (AUC) of the receiver operating characteristic (ROC) curve were calculated while using k-fold cross-validation (k = 10), as previously described.16 The complete code for building a random forest model, performance metrics, and validation is available.23 An example of simulated student data is also provided within the same repository, as Institutional Review Board restrictions and confidentiality concerns prohibit the public release of the raw student data.

Results

Selection of data

After eliminating records of dual-degree students and students on an LOA and running the synthetic minority oversampling technique to balance the dataset, a total of 8,090 student records were utilized for creation and validation of the random forest classification models. Student semester and completed credits were highly correlated (r = 0.98), and preclinical GPA was highly correlated to cumulative GPA (r = 0.99). While random forests can handle correlated data, they cannot handle nearly perfectly correlated data, so only student semester was retained, as all students complete the same preclinical curriculum. We retained only the preclinical cumulative GPA and excluded both the clinical and overall cumulative GPA due to variability in veterinary colleges’ grading schemes. Citizenship and country of origin were retained in the models due to these variables having low features of importance, and the possible inflation of feature importance scores appeared to be minimal.

Years 2013 to 2022 random forest model

The selected hyperparameters for the best performing random forest model utilizing all student data are reported in Supplementary Table S5.

The overall model that incorporated all data from 2013 to 2022 performed very well to excellently for all validation methods employed. Our model had an excellent ROC AUC score of 0.999 and excellent F1 score of 0.989 (Table 1). Additionally, our model had high accuracy, with 98.9% of the predictions being correctly classified. The recall (sensitivity) was 99.1%, meaning that 99.1% of the time, the model correctly predicted the students who truly graduated from RUSVM or were actively enrolled. The specificity was 98.8%, meaning that 98.8% of the time, the model correctly predicted the students who did not graduate from RUSVM. The results of the k-fold validation methods (k = 10) employed within the performance evaluation methods showed that the model performed well for any given set of data rather than only performing when given the original testing data.

Table 1

Performance metric results for the overall random forest model with data from 2013 to 2022 and each individual year model.

Model ROC AUC Accuracy Recall (sensitivity) Specificity Precision F1
2013–2022 0.998 0.989 0.991 0.988 0.987 0.989
2013 0.998 0.992 0.994 0.989 0.990 0.993
2014 0.998 0.995 0.998 0.989 0.990 0.995
2015 0.997 0.982 0.990 0.831 0.981 0.990
2016 0.994 0.992 0.998 0.995 0.995 0.995
2017 0.993 0.983 0.989 0.780 0.984 0.990
2018 0.992 0.977 0.988 0.893 0.977 0.983
2019 0.998 0.978 0.985 0.947 0.989 0.987
2020 0.981 0.960 0.988 0.892 0.959 0.971
2021 0.994 0.961 0.976 0.931 0.966 0.978
2022 0.997 0.963 0.988 0.767 0.969 0.977

AUC = Area under the curve. ROC = Receiver operating characteristic.

Most important features

Academic and financial features were identified as contributing the most to model predictions. The top 4 features contributing most to the prediction of our target variable, based on Gini impurity criterion values (reported in parentheses), were student semester (0.282), preclinical cumulative GPA (0.182), failed credits (0.117), and curriculum phase (0.101). All demographic features contributed < 0.012 to the final model. Figure 1 shows the top 10 features. All variables except GRE were included in the final model, and feature reduction methods were not used to reduce the dimensionality of the dataset due to the goals of this study. Supplementary Table S6 shows the feature importance ranking in descending order of importance, with the most important feature having the highest Gini importance value and the least important having the lowest Gini importance value.

Figure 1
Figure 1

The top 10 most important features for all students admitted between 2013 and 2022. These features contribute the most to the random forest prediction model when categorizing a student as one who will graduate or one who will not graduate. FAFSA = Free Application for Federal Student Aid. GPA = Grade point average. LoA = Leave of absence.

Citation: Journal of the American Veterinary Medical Association 263, 2; 10.2460/javma.24.08.0501

Demographic data

None of the demographic variables were identified as one of the top important features of the model. Supplementary Table S6 shows the demographic variables of age ranging from 23 to 29 are more important predictors in the model than the ethnicity/race variables.

Academic data

Most withdrawals occurred during the preclinical training phase of the curriculum. Graduated and actively enrolled students had a higher mean GPA compared to those who did not graduate. Students who graduated or were actively enrolled failed fewer credits and had a lower mean LOA count.

Financial aid data

Over half of students who did not graduate did not have an active FAFSA during their last semester. Eighty-seven percent of those who were actively enrolled had an active FAFSA and larger average outstanding account balance of $5,585 ± $10,069. Graduated and active students received nearly twice as many scholarships as those who did not graduate.

Individual year random forest models

Preclinical cumulative GPA, student semester, and failed credits were found to be included within the most important features in all yearly random forest models. In years where students had entered their clinical year, curriculum phase was found to be a top contributor, with students more likely to graduate if they entered the clinical phase of the RUSVM curriculum. The GRE was not identified as an important feature in the yearly models, which incorporated it as a variable. Its contribution was very minor, with the largest value being 0.025 in 2019. Only the 2017 and 2022 models had a demographic variable in the top 5 most important variables. The 2013 model identified 1 demographic variable of race (White) as fifth, with a value of 0.014. The 2022 model identified 1 demographic variable of origin nation (US) as the fifth most important feature, with a value of 0.024. Other most important features identified were financial aid holds, whether students had an active FAFSA on file, the number of LOAs, registration holds, scholarships, and the total outstanding balance. Nearly all the random forest models had only 3 to 4 variables with feature importance values > 0.10. Figure 2 summarizes the model features by year for the top 5 features. Supplementary Tables S7S16 show the feature importance ranking for all features.

Figure 2
Figure 2

The top 5 most important features for students admitted in each year from 2013 (A) through 2022 (J). For students admitted in 2013 through 2019 (A to G), these features contribute the most to the random forest prediction model when categorizing a student as one who will graduate or one who will not graduate. For students admitted in 2020 through 2022 (H to J), the features contribute most to the random forest prediction models when categorizing a student as one who is actively enrolled or one who has withdrawn (will not graduate).

Citation: Journal of the American Veterinary Medical Association 263, 2; 10.2460/javma.24.08.0501

Discussion

With the ever-changing dynamics of the veterinary student population, there is a need to develop effective methodologies to identify DVM candidates early in their educational pursuits who are at risk of academic difficulties. A recently published ML primer16 suggested that ML algorithms could be used to answer hypothesis-driven questions within the veterinary medical education field. Using this prior study as a launch point, we constructed random forest classification models, a type of ML, to identify potential risk factors of students who suffered academic difficulties while in a veterinary professional program. The study’s overarching goal was to identify variables that should be incorporated into future ML prediction models designed to identify students at risk of academic difficulties. Identifying at-risk students early within their veterinary education enables veterinary colleges to develop targeted interventions for these students prior to academic dismissal.

Our study results supported that the random forest classifier models by year were able to successfully differentiate students who either graduated or were active students in a year-round veterinary medical education training program from those who were not successful in completing the accelerated DVM program. While we used all data available from OutReachIQ, it is important to recognize that the RUSVM student population may not represent the student population at other veterinary colleges and additional or different variables should be assessed. The work we present here provides a framework that other veterinary colleges can adapt to their specific student populations. For example, approximately 35% of RUSVM applicants admitted each year identify as underrepresented minorities, so we incorporated race and ethnicity into a single model. Other colleges must consider whether incorporating race and ethnicity into a single model or creating separate models for underrepresented minority students is necessary.

Additional sources of data may need to be incorporated based on the curriculum. Ross University School of Veterinary Medicine follows a traditional, discipline-based curriculum, whereas other veterinary colleges deliver an integrated, systems-based approach. Data sources relevant to RUSVM and other veterinary programs may include learning management system data (eg, Canvas), Veterinary Medical College Application Service data, and in-house student surveys. These data sources offer a combination of academic and nonacademic factors.

Our models performed with excellent accuracy and sensitivity and had good to excellent specificity, with the lowest specificity occurring in the 2022 admission year model (Table 1). The overall lower specificity, or the ability of our model to classify students who did not graduate in 2017 and 2022, could be explained by a large portion of student withdrawals being related to transferring to a non–year-round DVM program or potentially related to delaying their education during the COVID-19 pandemic, rather than withdrawing due to failing multiple courses. A limitation of our study was that OutReachIQ did not include information on why a student withdrew from the program, as the institutional withdrawal form does not require students to provide a reason. We suggest that veterinary training programs consider requiring students to specify their reason for withdrawal on the form or, alternatively, conduct exit interviews. This information could be incorporated into future studies to help identify additional academic and nonacademic factors that influence withdrawals.

The lower specificity in 2022 may have been influenced by a lack of or less than desired preparation of the students for a rigorous veterinary school curriculum compared to students enrolled prior to the COVID-19 pandemic. We acknowledge that the majority of students admitted in 2022 completed their veterinary school prerequisites during the COVID-19 pandemic remotely and a significant portion of college students reported encountering serious challenges with online learning, with nearly half of college students reporting the belief that their academic performance declined.24 The European Union has recognized some of these issues in higher education and identified that “adapting assessment processes to safeguard quality standards and academic integrity in the context of online learning” is an area that needs to be urgently addressed.24

Adjusting the model hyperparameters (Supplementary Table S5) is just 1 approach for improving model performance. We did not include additional nonacademic variables because our variables were consistently collected across all reported years, which was necessary to assess changes over time. Specificity might have been improved by adding additional nonacademic variables, removing less-relevant variables (feature reduction),16 or by modifying variables, such as distinguishing between in-person and online courses. Future studies can address external factors which were not consistent or anticipated (ie, COVID-19) over the years of study.

As outlined in Table 1, all our random forest models performed excellently with very limited feature reduction, which further supports our hypothesis that random forest classifier models can successfully predict students who will be academically successful in an accelerated DVM program and those at risk of being unsuccessful in completing an accelerated DVM program. Feature reduction is commonly done by data scientists to improve prediction performance while reducing computation time and allowing a better understanding of the data by reducing the complexity of the mode.7,25 Our team decided not to reduce the number of features or variables in our models for the following 2 reasons. First, our models had high performance without feature reduction and no evidence of overfitting. Second, it is equally important for veterinary educators to understand which features are of minimal importance. Understanding all potential features ranging from most important to least important could help inform admissions policies such as whether GRE scores should be a required prerequisite for admission and will influence the development of programs designed to minimize academic difficulties and dismissals due to poor performance.8 This is particularly important, as ML models identify students prior to the start of the first semester or students who are in the early preclinical semesters when intervention will be most effective at reducing academic difficulty and withdrawals or dismissals.7,8,2628

Random forests are a robust prediction model. Prediction models are typically divided into classification models29 or regression models.30 We chose to use a classification model because we defined a successful student as one who graduated with their DVM degree or was currently enrolled and an active student at RUSVM, rather than the traditional academic achievement of GPA.31 We acknowledge that it is unknown whether high performance in medical or veterinary school is reflective on how good of a doctor a student becomes.32,33 Nor is it known how veterinary student GPA in preclinical courses correlates to a student being able to successfully perform all the AVMA Council on Education–specific clinical competencies expected of entry-level practitioners.34,35

Our dataset incorporated a combination of pre- and postadmission factors, incorporating academic, financial, and demographic predictors. These predictors were both categorical variables, such as the race or country of origin, and continuous variables, such as cumulative GPA or GRE score. By one-hot encoding the categorical variables, the random forest was able to use both categorical and continuous variables while simultaneously not biasing our results by ranking. Random forests are not highly susceptible to outliers, and with this robustness to outliers,36,37 all student data with complete records could be incorporated. Additionally, random forests do not require us to scale our features, meaning we did not need to standardize and normalize the variables prior to creating the models.38 Furthermore, random forests handle high nonlinearity between independent variables, as nonlinear parameters typically do not affect the performance of the decision tree models because the splitting of the decision nodes is based on absolute values (yes or no) and the branches are not based on a numerical value of the feature.18,3639 We highlighted the data preparation, ensuring data integrity that is accurate and unbiased. The disadvantage of using random forest models relates mostly to the time required to train the model and the computational power and resources required to create a lot of decision trees that compose the random forest model.38 Exploration of other ML algorithms, such as support vector machines, may perform similar to random forests while requiring less computational power.

Only academic and financial variables were consistently identified as important predictors both in the overall and yearly random forest models. All models consistently showed that students with higher preclinical cumulative GPAs were more likely to be successful. This is consistent with undergraduate cumulative GPA and science GPA being important predictors of veterinary student success.1,2,10,11 Additionally, as students progressed through the DVM degree program, they were less likely not to graduate and those with fewer failed credits were more likely to graduate and not be dismissed. Two studies,11,12 1 published in 2004 and 1 published in 2020, correlated GRE scores with veterinary student success.

Only beginning in 2016 did the GRE begin to be recognized as a very minorly important variable in our yearly models. The GRE-verbal and GRE-quantitative scores have not increased significantly over time,40 whereas in higher education, overall GPAs have increased over time. It is unclear whether this increase in cumulative GPAs is due to factors such as improved assessment literacy41 or “grade inflation” as a result of faculty needing positive student evaluations for promotion and tenure.41 This suggests that further assessment into covariates contributing to higher GPAs should be pursued in future studies.

Key financial variables were considered, including outstanding balances, whether students had an active FAFSA on file, and scholarships, because veterinary students report financial worries and constraints as nonacademic challenges negatively impacting their academic performance.5,6 Total outstanding balance as well as whether a student had a FAFSA on file were both identified in the top 10 most important features for the overall model. Total outstanding balance was also identified in the past 7 years as one of the most important factors. Active students were more likely to have a larger outstanding balance than those who did not successfully graduate. Interestingly, when assessing the linear relationship of preclinical cumulative GPA and total balance with the Pearson correlation coefficient, there was no correlation (r = –0.002).

In conclusion, study results supported the exploratory use of random forest classifier models in accurately predicting students at risk as well as those who were academically successful. Future studies should assess the use of random forest models for early prediction and identification of students in their veterinary education and explore what types of interventions can be mostly effective in minimizing dismissals due to poor academic performance. Additionally, there is a research gap regarding best practices for using ML in evaluating veterinary student applicants and admission processes. These guidelines should be established before exploring variables like mental health status and social determinants to ensure that explicit and implicit bias is avoided. This is particularly important when assessing how such variables might contribute to the students’ overall GPA, which was consistently identified as a key factor in student success. Just as important is the communication of the findings to the practicing veterinary community, as veterinarians could be invaluable to social belonging and other key intervention strategies for veterinary students.

Supplementary Materials

Supplementary materials are posted online at the journal website: avmajournals.avma.org.

Acknowledgments

None reported.

Disclosures

The authors have nothing to disclose. No AI-assisted technologies were used in the generation of this manuscript.

Funding

Funding for this study was provided by the Ross University School of Veterinary Medicine Research Center for Veterinary Education, Diversity, and Data Analytics and Ross University School of Veterinary Medicine intramural grant 44019-2023.

References

  • 1.

    Van Vertloo LR, Burzette RG, Danielson JA. Predicting academic difficulty in veterinary medicine: a case-control study. J Vet Med Educ. 2022;49(4):524-530. doi:10.3138/jvme-2021-0034

    • Search Google Scholar
    • Export Citation
  • 2.

    Rush BR, Sanderson MW, Elmore RG. Pre-matriculation indicators of academic difficulty during veterinary school. J Vet Med Educ. 2005;32(4):517-522. doi:10.3138/jvme.32.4.517

    • Search Google Scholar
    • Export Citation
  • 3.

    Raidal SL, Lord J, Hayes LM, Hyams J, Lievaart J. Student selection to a rural veterinary school, I: applicant demographics and predictors of success within the application process. Aust Vet J. 2019;97(6):175-184. doi:10.1111/avj.12820

    • Search Google Scholar
    • Export Citation
  • 4.

    OutReachIQ Database. Ross University School of Veterinary Medicine; version XXXX. Accessed August 1, 2022.

  • 5.

    Liu AR, van Gelderen IF. A systematic review of mental health-improving interventions in veterinary students. J Vet Med Educ. 2020;47(6):745-758. doi:10.3138/jvme.2018-0012

    • Search Google Scholar
    • Export Citation
  • 6.

    Hafen M Jr, Drake AS, Elmore RG. Predictors of psychological well-being among veterinary medical students. J Vet Med Educ. 2022;50(3):297-304. doi:10.3138/jvme-2021-0133

    • Search Google Scholar
    • Export Citation
  • 7.

    Alyahyan E, Düştegör D. Predicting academic success in higher education: literature review and best practices. Int J Educ Technol High Educ. 2020;17(1):3. doi:10.1186/s41239-020-0177-7

    • Search Google Scholar
    • Export Citation
  • 8.

    Harackiewicz JM, Priniski SJ. Improving student outcomes in higher education: the science of targeted intervention. Annu Rev Psychol. 2018;69:409-435. doi:10.1146/annurev-psych-122216-011725

    • Search Google Scholar
    • Export Citation
  • 9.

    Yeager DS, Walton GM, Brady ST, et al. Teaching a lay theory before college narrows achievement gaps at scale. Proc Natl Acad Sci USA. 2016;113(24):e3341-e3348. doi:10.1073/pnas.1524360113

    • Search Google Scholar
    • Export Citation
  • 10.

    Confer AW, Lorenz MD. Pre-professional institutional influence on predictors of first-year academic performance in a veterinary college. J Vet Med Educ. 1999;26:16-20.

    • Search Google Scholar
    • Export Citation
  • 11.

    Danielson JA, Burzette RG. GRE and undergraduate GPA as predictors of veterinary medical school grade point average, VEA scores and NAVLE scores while accounting for range restriction. Front Vet Sci. 2020;7:576354. doi:10.3389/fvets.2020.576354

    • Search Google Scholar
    • Export Citation
  • 12.

    Powers DE. Validity of Graduate Record Examinations (GRE) general test scores for admissions to colleges of veterinary medicine. J Appl Psychol. 2004;89(2):208-219. doi:10.1037/0021-9010.89.2.208

    • Search Google Scholar
    • Export Citation
  • 13.

    Holladay SD, Gogal RM, Karpen S. Brief communication: predictive value of veterinary student application data for performance in clinical year 4. J Vet Med Educ. 2022;49(6):748-750. doi:10.3138/jvme-2021-0012

    • Search Google Scholar
    • Export Citation
  • 14.

    Green WH, Watson SE, Kennedy GA, Miceli CA, Taboada J. Forecasting veterinary school admission probabilities for undergraduate student profiles. J Vet Med Educ. 2006;33(3):441-446. doi:10.3138/jvme.33.3.441

    • Search Google Scholar
    • Export Citation
  • 15.

    Holistic review. Association of American Medical Colleges. Accessed July 17, 2024. https://www.aamc.org/services/member-capacity-building/holistic-review

    • Search Google Scholar
    • Export Citation
  • 16.

    Hooper SE, Hecker KG, Artemiou E. Using machine learning in veterinary medical education: an introduction for veterinary medicine educators. Vet Sci. 2023;10(9):537. doi:10.3390/vetsci1009053717

    • Search Google Scholar
    • Export Citation
  • 17.

    He L, Levine RA, Fan J, Beemer J, Stronach J. Random forest as a predictive analytics alternative to regression in institutional research. Practical Assess Res Eval. 2018;23:1.

    • Search Google Scholar
    • Export Citation
  • 18.

    Spoon K, Beemer J, Whitmer JC, et al. Random forests for evaluating pedagogy and informing personalized learning. J Educ Data Mining. 2016;8(2):20-50. doi:10.5281/zenodo.3554595

    • Search Google Scholar
    • Export Citation
  • 19.

    Doleck T, Lajoie S. Social networking and academic performance: a review. Educ Inf Technol (Dordr). 2018;23(1):435-465. doi:10.1007/s10639-017-9612-3

    • Search Google Scholar
    • Export Citation
  • 20.

    Geographic Levels. US Census Bureau. Accessed July 16, 2024. https://www.census.gov/programs-surveys/economic-census/guidance-geographies/levels.html.

    • Search Google Scholar
    • Export Citation
  • 21.

    Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12(85): 2825-2830.

  • 22.

    Old GRE to new GRE comparison. Prosper Overseas. Accessed July 16, 2024. http://www.prosperoverseas.com/old-and-new-gre-comparison-tool

  • 23.

    Random forest machine learning models for veterinary student recruitment and retention. MicroBatVet. GitHub Inc; 2024. https://github.com/RUSVMCenter4/Random-Forest-Machine-Learning-Models-for-Veterinary-Student-Recruitment-and-Retention. doi.org/10.5281/zenodo.14051066

    • Search Google Scholar
    • Export Citation
  • 24.

    Farnell T, Skledar Matijević A, Šćukanec Schmidt N; European Commission: Directorate-General for Education, Youth, Sport and Culture; Public Policy and Management Institute. The Impact of COVID-19 on Higher Education: A Review of Emerging Evidence. Publications Office of the European Union; 2021.

    • Search Google Scholar
    • Export Citation
  • 25.

    Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40(1):16-28. doi:10.1016/j.compeleceng.2013.11.024

    • Search Google Scholar
    • Export Citation
  • 26.

    Calvet Liñán L, Juan Pérez ÁA. Educational data mining and learning analytics: differences, similarities, and time evolution. Int J Educ Technol High Educ. 2015;12:98-112. doi:10.7238/rusc.v12i3.2515

    • Search Google Scholar
    • Export Citation
  • 27.

    Algarni A. Data mining in education. Int J Adv Comput Sci Appl. 2016;7(6):456-461. doi:10.14569/IJACSA.2016.070659

  • 28.

    Larusson JA, White B, eds. Learning Analytics: From Research to Practice. Springer; 2014. doi:10.1007/978-1-4614-3305-7

  • 29.

    Umadevi S, Marseline KSJ. A survey on data mining classification algorithms. In: 2017 International Conference on Signal Processing and Communication. Institute of Electrical and Electronics Engineers; 2017:264-268. doi:10.1109/CSPC.2017.8305851

    • Search Google Scholar
    • Export Citation
  • 30.

    Bragança R, Portela F, Santos M. A regression data mining approach in Lean Production. Concurr Comput. 2019;31(22):e4449. doi:10.1002/cpe.4449

    • Search Google Scholar
    • Export Citation
  • 31.

    Parker JDA, Summerfeldt LJ, Hogan MJ, Majeski SA. Emotional intelligence and academic success: examining the transition from high school to university. Pers Individ Dif. 2004;36(1):163-172. doi:10.1016/S0191-8869(03)00076-X

    • Search Google Scholar
    • Export Citation
  • 32.

    Hudson NPH, Rhind SM, Mellanby RJ, Giannopoulos GM, Dalziel L, Shaw DJ. Success at veterinary school: evaluating the influence of intake variables on year-1 examination performance. J Vet Med Educ. 2020;47(2):218-229. doi:10.3138/jvme.0418-042r

    • Search Google Scholar
    • Export Citation
  • 33.

    Cleland J, Dowell J, McLachlan J, Nicholson S, Patterson F. Identifying Best Practice in the Selection of Medical Students (Literature Review and Interview Survey). General Medical Council; 2012.

    • Search Google Scholar
    • Export Citation
  • 34.

    Chaney KP, Hodgson JL, Banse HE, et al.; AVMC Council on Outcomes-based Veterinary Education. Competency-Based Veterinary Education: CBVE 2.0 Model. American Association of Veterinary Medical Colleges; 2024.

    • Search Google Scholar
    • Export Citation
  • 35.

    Molgaard LK, Hodgson JL, Bok HGJ, et al.; AAVMC Working Group on Competency-Based Veterinary Education. Competency-Based Veterinary Education: Part 1 - CBVE Framework. Association of American Veterinary Medical Colleges; 2018.

    • Search Google Scholar
    • Export Citation
  • 36.

    Cutler A, Cutler DR, Stevens JR. Random forests. In: Zhang C, Ma Y, eds. Ensemble Machine Learning: Methods and Applications. Springer; 2012:157-175.

    • Search Google Scholar
    • Export Citation
  • 37.

    Horning N. Random forests: an algorithm for image classification and generation of continuous fields data sets. Presented at: The International Conference on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences; December 9-11, 2010; Hanoi, Vietnam. Accessed October 22, 2022. https://www.geoinfo-lab.org/gisideas10/papers/Random%20Forests%20%20An%20algorithm%20for%20image%20classification%20and%20generation%20of.pdf

    • Search Google Scholar
    • Export Citation
  • 38.

    Kumar N. Advantages and disadvantages of random forest algorithm in machine learning. The Professionals Point. February 23, 2019. Accessed July 17, 2024. https://theprofessionalspoint.blogspot.com/2019/02/advantages-and-disadvantages-of-random.html

    • Search Google Scholar
    • Export Citation
  • 39.

    Louppe G. Understanding random forests: from theory to practice. ArXiv. Preprint posted online July 28, 2014. Revised June 3, 2015. Accessed October 22, 2022. doi:10.48550/arXiv.1407.7502

    • Search Google Scholar
    • Export Citation
  • 40.

    Bleske-Rechek A, Browne K. Trends in GRE scores and graduate enrollments by gender and ethnicity. Intelligence. 2014;46:25-34. doi:10.1016/j.intell.2014.05.005

    • Search Google Scholar
    • Export Citation
  • 41.

    Stroebe W. Student evaluations of teaching encourages poor teaching and contributes to grade inflation: a theoretical and empirical analysis. Basic Appl Soc Psych. 2020;42(4):276-294. doi:10.1080/01973533.2020.1756817

    • Search Google Scholar
    • Export Citation
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 1441 1441 173
PDF Downloads 228 228 44
Advertisement