One key component of veterinary medical education is the development of surgical skills fundamental to veterinary practice, such as suturing.1,2 New veterinarians are expected to become competent surgeons through practice and feedback during their surgical education.1,2 Performance assessment (PA) is a critical part of surgical education, and the quality of assessment frequently determines the quality of learning in an instructional program.2–7 Veterinary students may not receive significant hands-on surgical instruction or meaningful PA due to limited numbers of surgical faculty and training opportunities.3,4,8–10 Assessment may also be subjective and unreliable due to inconsistencies in the clinical contexts in which skills are demonstrated.5,7,8,11–13 As there are ethical implications to learning surgical skills on patients, Halsted’s classic apprenticeship model of “see one, do one, teach one” may be less relevant to modern surgical education than clinical skills training laboratories, surgical simulations, and simulation-based assessments.8,10,11,13,14
While competency-based veterinary curricula feature standardized training programs that use objective assessments of clinical and technical skills, the ideal method of surgical skills assessment of veterinary students is not known.2,5,8,10,12,15,16 Three PAs commonly used in medical and veterinary education include the Objective Structured Clinical Examination (OSCE),2,16–22 the Objective Structured Assessment of Technical Skills (OSATS),2,3,5,11,23,24 and the Global Rating Scale (GRS).2,3,5,8,11,18,22,23 While checklists for surgical OSCE stations assess students’ abilities to perform individual steps of a procedure, the OSATS and GRS assess overall competence.12 Although many PAs are meant to be objective, the subjective input of raters introduces variance, rater errors, and biases that may weaken the assessment of competence.2,7,12,13,18,21
In contrast, the assessment of surgical skills by artificial intelligence (AI) and machine learning models trained on visual and physiologic data has the potential to standardize evaluations, reduce subjectivity and supervision required by experts, and provide more objective, consistent, and immediate feedback to students.7,12,25,26 Surface electromyographical (sEMG) data and other motion data of the hands and arms have been collected during suturing and used to classify surgical gestures using machine learning in human medicine.7,25–29 To date, however, the study of AI-based assessment of surgical skills of veterinary students using data appropriate for input to machine learning algorithms has not been reported.
The objectives of this study were to evaluate second-year veterinary students’ performance of 4 surgical tasks 1, 3, and 5 weeks after a surgical skills course using task-specific OSCE-style and OSATS checklists and surgical GRS rubrics scored by expert raters and to compare the findings of each. To determine the feasibility of using motion data input to machine learning algorithms for the assessment of suturing skills, we collected sEMG data from students. Our goals were to compare sEMG signals to scores from the most detailed assessment, the OSCE-style checklist, and to input features extracted from the sEMG data to machine learning algorithms to train a classification model to distinguish the surgical tasks. We hypothesized that the collection of sEMG data and development of the classification model would be feasible and that EMG signals would correlate with OSCE-style checklist scores, providing support for further study of AI-facilitated assessment of surgical skills in veterinary education.
Materials and Methods
Study participants
This study was approved by the Institutional Review Board of the University of Illinois Urbana-Champaign (protocol number 21177). Sixteen second-year veterinary students were randomly selected from a pool of volunteers. Informed consent was obtained, and 3 suturing sessions were completed 1, 3, and 5 weeks after a surgical skills training course was administered as part of the University of Illinois’ second-year veterinary curriculum.
Procedures
Students completed pre-participation and post-participation questionnaires using a 5-point Likert scale (Supplementary Material S1). Before each session, students responded to questions regarding suturing experience, time spent practicing, and most practiced skills. After each session, students responded to questions regarding the experimental setup and design.
To begin, 1 investigator (H.S.) placed a wireless, multi-sensor EMG armband (Myo Gesture Control Armband, Thalmic Labs) on each student’s dominant forearm ∼5 cm distal to the olecranon with sensor 2 over the flexor carpi radialis muscle for consistency. Seven other sensors were spaced evenly around the forearm to sense muscle activity and record sEMG data (Figure 1). Students donned non-sterile blue surgical gowns and nitrile gloves, an orange glove on the dominant hand, and a green glove on the non-dominant hand, to aid in distinguishing hand movements (Figure 2). Seated at a surgical workstation, students received written instructions and viewed custom videos outlining each task. Tasks were selected based on suturing skills students might be asked to demonstrate at a surgical station of a summative examination.
Students were given the choice of forceps, scissors, and needle drivers along with 3-0 poliglecaprone-25 suture on a 26mm ½ circle taper needle (Monocryl, Ethicon Inc.). A synthetic tissue model created at the University of Illinois was also provided (Figure 2). The two-layer silicone model was designed to mimic the dermis and subcutis of canine skin after extensive prototyping in our institution’s clinical skills learning center and feedback from 6 faculty and 4 resident small animal surgeons. The final product was fabricated in a 2-step process. The dermis (Body Double™ SILK; Smooth-On, Inc.) was spread 3.2 mm thick onto a textured vinyl surface and allowed to cure for 18 hours. The subcutis (Ecoflex™ 00-30, Smooth-On, Inc.) was then poured over the dermis and allowed to cure for 4 hours. Concentrated pigments for silicone of black, red, white (dermis), and yellow (subcutis) were used to obtain a pale pink dermis with an opaque but mildly translucent, yellow-white subcutis (Silc Pig™ Smooth-On, Inc.). The resulting two-layer model delaminated after incision similar to incised canine skin. The subcutis was sufficiently translucent to permit evaluation of the suture pattern from both the dermal and subcutis sides of the model.
Students were asked to select appropriate instruments, hold them with proper technique, and complete 4 suturing tasks using the tissue model: (1) passing a needle across a mock incision; (2) tying a simple interrupted knot; (3) performing a simple continuous pattern; and (4) burying a simple interrupted knot. At the week 1 session, students performed the first 3 tasks; the fourth, more difficult task was added for weeks 3 and 5. A digital camera (PTZOptics Move 4K SDI/HDMI/USB/IP pan, tilt, and zoom camera, B & H Foto & Electronics Corp.) positioned on a tripod over the right shoulder of each participant recorded students’ hands and forearms and performance of each task. Investigators (J.K., H.P.) observed students remotely via the digital camera and, using a 2-way speaker system (Jabra Speak 510+ UC wireless Bluetooth/USB speaker, GN Audio), instructed students to begin each task. Between tasks, students were instructed to return to a “neutral” position, a reference pose used to normalize the sEMG data (Figure 2). Task completion times were recorded in seconds. Investigators (J.K., H.P.) provided each student with structured, written feedback to guide deliberate practice.
Rater assessment of digital recordings of student performance
Videos of each task were randomized by student and week before scoring by 4 expert raters- 3 board-certified surgeons and 1 residency-trained surgeon (C.M., H.G., K.P., S.T.) not involved in study design and blinded to student identification and week. Raters received detailed written instructions for scoring using task-specific OSCE-style and OSATS checklists and a surgical GRS rubric (Supplementary Material S2) and screened the written instructions provided to students. Raters then viewed the digital recordings and assessed each performance by applying each rubric. The OSCE-style checklist detailed crucial steps for each task formulated into “Yes/No” questions.17,30 The OSATS checklist was a list of task-specific operative competencies with 3 scoring options: Done, Done Incorrectly, and Not Done.11,30,31 The surgical Global Rating Scale (GRS) was based on the sum of 6 dimensions of suturing: respect for tissue, time and motion, instrument handling, the flow of operation, knowledge of the specific procedure, and quality of the final product.11,31 The OSCE-style checklist required evaluation of each individual task while the OSATS and GRS required collective evaluation of all tasks performed by each student each week. Raters completed a questionnaire regarding demographic information and an anonymous questionnaire designed to assess for signs of rater error or bias (Supplementary Material S3).
sEMG data acquisition
During suturing, data were collected from the armband by 8 surface EMG sensors and one 9-axis (3-axis acceleration, 3-axis gyroscope, and 3-axis magnetometer) inertial measurement unit (IMU) sensor. Each EMG sensor used 3 dry electrodes of medical grade stainless steel for reference mass-based surface measurement with a fixed sample rate of 200 Hz.32 However, the data were down-sampled to 50 Hz to facilitate synchronization of the sEMG data with data from the IMU sensor, which has a sampling rate of 50 Hz. With an expandable diameter between 19 and 34 cm and 93 g weight, the armband was highly flexible and expected to be comfortable and unobtrusive.32,33 Sensors acquired raw sEMG data of superficial lower arm muscle activity during task-related wrist, hand, and finger movements and transferred data using a Bluetooth master client connection (Bluetooth short-range wireless technology, Bluetooth Special Interest Group).32–34 The complete Materials and Methods for sEMG data acquisition and processing are provided elsewhere (Supplementary Material S4).
Machine learning
Four time-domain (TD) features of EMG data commonly used to characterize sEMG signals, Mean Absolute Value (MAV), Root Mean Square (RMS), Variance (VAR), and Waveform Length, (WL) were calculated for all sEMG signals over a sliding window size of 8 sampling points equaling 0.16 seconds, with no overlap between windows.35,36 The complete Materials and Methods for machine learning data acquisition and processing are provided elsewhere (Supplementary Material S4). A flow diagram outlining the processing and treatment of the sEMG data is provided elsewhere (Supplementary Figure S1).
Statistical analysis
Statistical analyses were performed within the R computing environment using the base ‘stats’ package and the ‘rstatix’ and ‘irr’ packages.37–39 Data were visualized using the ‘ggplot2’ package.40 Normality was assessed using the Shapiro-Wilk test. As scores by week and rater across all students were generally normally distributed, a 2-way repeated measures analysis of variance (ANOVA) was used to assess for changes in score over time and by rater and for interactions between time and rater. Pairwise comparisons with Bonferroni correction were performed to establish scoring differences between raters. As different total OSCE-style checklist scores were possible for week 1 versus weeks 3 and 5, scores were analyzed as proportion earned of the maximum total possible points rather than raw scores. Inter-rater reliability (IRR) for each PA was assessed using the intraclass correlation coefficient (ICC). ICC values < 0.5 were considered indicative of poor reliability, 0.5–0.75 indicative of moderate reliability, 0.75–0.9 indicative of good reliability, and >0.9 indicative of excellent reliability.41 Inter-rater agreement (IRA) was assessed as percentages with various tolerance limits applied. Reliability and agreement were assessed with scores reported as proportions of the maximum score and with scores ranked by decile. Spearman rank correlation tests were performed to correlate individual sensor sEMG signals with task-specific OSCE checklist scores (averaged across raters) and task completion times. Associations between student questionnaire results and PA scores at specific time points (averaged across all raters) were assessed using 1-way ANOVA. Changes in individual sensor sEMG signal over time and by task were assessed by 2-way repeated measures ANOVA followed by pairwise comparisons with Bonferroni correction. Task completion times (in seconds) were assessed using 1-way repeated-measures ANOVA. Summary descriptive statistics are provided for questionnaire data not sufficiently varied for meaningful statistical analyses. Statistical significance for all tests was set at P < .05 with Bonferroni correction for multiple testing as appropriate.
Results
Study cohort and pre- and post-participation questionnaire responses
Data from all 16 enrolled students were included. There were 3 male and 13 female students. Pre- and post-participation questionnaire results are summarized (Supplementary Tables S1 and S2). No pre-participation variable was associated with any performance score at week 1(P > .18). While students reported practicing on average 1–10 hours/week, there was no association between practice times and performance scores in subsequent weeks (P > .1). Although students reported no discomfort with the experimental canopy or armband, students variably reported anxiety associated with suturing. Post-participation questionnaire variables were not associated with any other variable or outcome measurement.
Performance assessment
OSCE and OSATS checklist and GRS scores for all raters for all students, tasks, and weeks were included in statistical analyses. Scores were assessed as proportions earned of the maximum available points since the maximum differed by scale, task, and week. Total OSCE and OSATS checklist scores and GRS scores by rater and week are summarized (Supplementary Table S3; Supplementary Figure S2). For each PA, total scores were different between raters (P < .001 for all 3 PAs) but did not change over time (P > .23). Complete ANOVA models are presented (Supplementary Tables S4–S6). For total OSCE and OSATS checklist scores over all weeks, rater 1 scored higher than raters 3 and 4, and rater 2 scored higher than rater 3 (P < .004 for both rubrics). For GRS, raters 2 and 4 scored higher than raters 1 and 3 (P < .0002).
Descriptive data for task completion times are reported (Supplementary Table S7; Supplementary Figure S3). Completion times were different for all tasks (P < .0001). Task 3 was the longest followed by task 4, task 2, and task 1. Time to completion did not change over the study for any task (P > .51).
Electromyography
Data for all 8 sensors by task and week are summarized (Supplementary Table S8). Regarding the effects of task and time on sEMG over all weeks, some sensors varied by task, but time did not affect sEMG signal for any sensor (P > .05). ANOVA models are presented elsewhere (Supplementary Tables S9–S16). Sensor 5 had a lower signal for task 1 than task 3 (P = .02), sensor 6 had a lower signal for task 1 than tasks 2, 3, and 4 (P < .01), and sensor 7 had a lower signal for task 1 than tasks 2, 3, and 4 (P < .04). Sensors 2 and 5 had lower signals at week 3 than week 5 (P = .01, P = .009, respectively). Lower sensor 5 signals correlated with faster completion times for task 2 (ρ = 0.29, P = .04), and higher sensor 6 signals correlated with faster completion times for task 3 (ρ = −0.33, P = .02; Supplementary Table S17).
Correlations between the mean sEMG signal of each sensor and OSCE checklist score averaged over all raters were calculated for each task (Supplementary Table S18). Since sEMG signals did not change over time within a task, all weeks were analyzed simultaneously. For task 1, lower mean sEMG signals at sensors 6 and 8 correlated with higher OSCE checklist scores (ρ = −0.30, P = .04; ρ = −0.34, P = .02, respectively). For task 4, sensors 3 and 4 showed a similar negative correlation with mean OSCE score that did not reach statistical significance (ρ = −0.32, P = .08; ρ = −0.33, P = .07, respectively).
Inter-rater reliability and agreement
IRR and IRA were calculated using OSCE and OSATS checklist scores and GRS scores of all raters across all students and weeks with scores reported as proportions of the maximum score and with scores ranked by decile. Reliability was moderate to poor for the OSCE and OSATS checklists and poor for the GRS (Table 1). IRA was calculated using varying tolerances and was greater for the OSCE and OSATS checklists than for the GRS (Table 2). Excellent agreement was achieved for the OSCE and OSATS checklists only when a tolerance of 20 percentage points (or 2 deciles) was allowed; even with these allowances, IRA for the GRS was poor.
Inter-rater reliability for the OSCE, OSATS, and GRS rubrics as assessed by intraclass correlation coefficient (ICC).
Performance assessment | ICC for proportional scores | 95% CI | ICC for scores ranked by decile | 95% CI |
---|---|---|---|---|
OSCE | 0.48 | 0.32–0.63 | 0.41 | 0.26–0.57 |
OSAT | 0.43 | 0.28–0.58 | 0.37 | 0.22–0.53 |
GRS | 0.28 | 0.11–0.46 | 0.27 | 0.10–0.45 |
Scores of all raters across all students and weeks were calculated with scores reported as a proportion of the maximum score and as ranked by decile. Reliability was moderate to poor for the OSCE and OSATS and poor for the GRS using both methods. CI = Confidence interval. GRS = Global Ratings Scale. OSAT = Objective Structured Assessment of Technical Skill. OSCE = Objective Structured Clinical Examination.
Inter-rater agreement (IRA) for OSCE, OSATS, and GRS rubrics calculated as percent agreement using varying tolerances.
Percent agreement for proportional scores | Percent agreement for scores ranked by decile | |||||
---|---|---|---|---|---|---|
Performance assessment | Tolerance 0 | Tolerance 0.1 | Tolerance 0.2 | Tolerance 0 | Tolerance 1 | Tolerance 2 |
OSCE | 0 | 62.5 | 97.9 | 17.0 | 85.1 | 97.9 |
OSAT | 2.1 | 68.8 | 97.9 | 31.2 | 89.6 | 100 |
GRS | 0 | 2.1 | 19.1 | 2.1 | 6.4 | 34.0 |
Scores were reported as a proportion of the maximum score and as ranked by decile. Excellent agreement was only achieved for the OSCE and OSATS when a tolerance of 20 percentage points or 2 deciles was allowed; even with these allowances, IRA for the GRS was poor. See Table 1 for key.
Rater questionnaires
Three raters were small animal surgical faculty members at colleges of veterinary medicine at the time of the study; 1 rater practiced as a locum tenen surgeon. All 4 raters completed rotating internships and residencies in small animal surgery at academic institutions in the continental United States within the last 6 years. Three raters were certified by the American College of Veterinary Surgeons.
Responses to an anonymous questionnaire designed to assess for rater errors and biases are summarized (Supplementary Table S19). Although all raters found the OSCE and OSATS checklist and GRS rubrics relevant, raters differed on whether the rubrics were easy to understand and apply to student performance. Raters had widely differing experiences using OSCE and OSATS checklist assessments, and no rater had experience using a GRS. Three raters reportedly scored most students as average and more or less adept only if their performance was exceptionally strong or weak, respectively. Raters reported differing tendencies to be lenient or strict in evaluating students, and 1 rater reportedly became stricter over time. Although 3 raters reported scoring students against a gold standard, 2 raters also reported scoring students’ surgical performances against their peers. Raters differed on whether the performance of 1 task affected their evaluation of other tasks. Although 3 raters reported being equally alert and attentive to all performances, 2 raters also reported being tired and easily distracted evaluating performances later in a scoring session. Raters also differed on whether their mood was favorable or unfavorable during the evaluations.
Machine learning results
Classification results are presented (Table 3). The chosen features extracted from the sEMG data and input to the CNN-LSTM classifier (MAV, RMS, VAR, and WL) permitted accurate classification of tasks as tasks 1, 2, 3, or 4. The difference between the lowest and highest accuracy values for each of the extracted features was less than 2%, indicating all features were equally accurate in classifying the suturing tasks. Confusion matrices for the TD features are presented (Figure 3). Precision, recall, and F1 score were high for all TD features indicating the classification model yielded few false positive or false negative results. The uniform distribution of the off-diagonal elements in the confusion matrices confirmed that the model had not been overfitted to any of the classes. While tasks were accurately differentiated using sEMG features, features extracted from the sEMG data did not differ sufficiently between students and weeks to compare surgical proficiency or characterize changes in suturing skills over time.
Performance metrics for 4 time-domain features of the sEMG data using a Convoluted Neural Network Long Short-Term Memory classifier model.
Accuracy % | Precision % | Recall (sensitivity) % | F1-score % | |
---|---|---|---|---|
MAV | 89.2 | 90.3 | 88 | 88.7 |
RMS | 89.2 | 90.5 | 88.5 | 89.2 |
VAR | 88.6 | 90.5 | 86.5 | 87.4 |
WL | 90.3 | 90.8 | 89.3 | 89.9 |
The difference between the lowest and highest accuracy values for all features is less than 2%, indicating all features were equally accurate in classifying the surgical tasks. sEMG = Surface electromyography. MAV = Mean absolute value. RMS = Root mean square. VAR = Variance. WL = Waveform length.
Discussion
In this study performed at a single institution, OSCE-style checklist, OSATS checklist, and surgical GRS performance assessments were unreliable for the evaluation of suturing skills of second-year veterinary students. Students’ scores varied widely despite evaluation by a cohort of similarly expert raters instructed on the assessment rubrics. Whether OSCE or OSATS checklist scores were categorized by proportions of maximum points available or ranking by deciles, unacceptably high levels of tolerance (20 percentage points or 2 deciles) were required to achieve excellent agreement between raters. Even with high levels of tolerance, IRA was poor for the GRS. Raters reported widely differing moods, endurance, and approaches to scoring during evaluation that may have contributed to low reliability.9,42 Students’ prior experience with suturing did not affect performance, and students’ scores and times to task completion did not improve despite most students practicing between testing time points. Our findings indicate the collection of sEMG data was feasible and well-tolerated by students, but sEMG signals were only correlated to OSCE checklist scores and speed of task completion for task 1, the simplest task. Although sEMG data were not correlated with performance scores assessed by expert raters, sEMG data were able to differentiate among tasks by routine statistical analysis and to classify tasks by input to machine learning neural network algorithms.29,43 As the experience of the experimental design and anxiety associated with suturing had little to no effect on performance, the collection of EMG data and assessment by machine learning models could provide an objective assessment of suturing skills of veterinary students. Further investigation of this application is warranted.
Performance assessments are skills evaluations wherein raters observe students perform tasks and provide evaluations based on criterion-referencing, or previously determined standards.4,12,38 The OSCE is widely used in medical and veterinary education to test a variety of competencies including surgical skills.16,17 At surgical OSCE stations, detailed checklists of standardized tasks and structured scoresheets prompt raters to use a stepwise approach to assess objectified skills such as suturing.9,15,17 Although our OSCE checklist was designed by surgical experts, delineated expectations for each standardized task, and described assessment criteria,8,12 reliability and agreement among raters were insufficient to recommend its use for low-stakes formative or high-stakes summative assessments.2,8,9,19,20 The OSCE-style checklist did have the highest reliability of all assessments in this study. This may be because checklists reward attention to detail and adherence to technique, and the OSCE is reportedly most useful early in surgical programs to assess novice learners who tend to adopt a stepwise approach to problem-solving.19,22
Although thorough, OSCE checklists may overly fragment skills, preventing global assessment.3,19,20 By contrast, GRSs consist of principles of operative skill common to all surgical procedures scored on a 5-point Likert scale anchored by descriptors at the middle and extreme points to guide assessment.2,3,19,22,32 However, the GRS had the lowest reliability of all PAs in this study. This may not be unexpected, as a recent systematic review indicated GRSs are not sufficiently reliable for procedure-specific summative assessment in high-stakes examinations.5,44,45 While descriptors in Likert scales reportedly enable reliable scoring without extensive rater training,2,11,20 the Likert scale in GRSs has also been criticized.5 Describing only certain points along the scale may make intermediate points less likely to be chosen by raters, increasing subjectivity and decreasing score reliability.5 Additionally, descriptors often incorporate vague terms open to interpretation by raters, further weakening GRS reliability.5 The OSATS was developed to improve reliability by use of operation-specific checklists of key procedural steps and “Done, Not Done, and Done Incorrectly” scoring options.2,23,32 While our OSATS checklist followed this template, it was also insufficiently reliable to recommend its use in evaluating suturing skills of veterinary students. Once a gold standard assessment, the OSATS has also failed a recent systematic review, lacking evidence to support its use in high-stakes examinations or in the operating room.5,45
Although the OSCE and OSATS checklists and surgical GRS offer greater objectivity and contextual consistency than clinical evaluations,2,8,11,12,20 conventional PAs allow for subjectivity, rater errors, and biases.12 Interrater reliability is a measure of rater consistency in distinguishing skills based on predetermined criteria.3,4,12,42,46 Interrater agreement measures the extent to which different raters assign the absolute same scores to the same students for the same tasks. While a high IRA clearly indicates student ability, a high IRR indicates unity in how raters assign priority and value in the scoring.4 Rater factors such as fatigue, mood or affect, and biases are the most important contributors to overall examination error,14,42,44 and rater factors likely contributed to the lack of reliability and agreement associated with PAs in this study. We utilized video recordings of performances to enable the blinding of raters to student identity and to improve flexibility, allowing multiple raters to review all performances.3,6,7,9,21,29 Although starting or stopping recordings, taking breaks, and replaying performances can reduce rater fatigue and prevent inattentional blindness,3,6,9 lack of endurance, tiredness, distraction, and inattentiveness were reported by raters in our study.
Common biases or errors affecting evaluators include those of central tendency, anchoring, leniency and severity, and contrast.8,42,44 While we attempted to prevent biases and errors by randomizing video assignments and blinding raters to student demographics, some raters reported signs of central tendency bias, scoring most items as average, and contrast bias, comparing students to their peers. Additionally, raters’ experience, mood, and opinions of the quality and relevance of the OSCE or OSATS checklist or surgical GRS rubrics may affect reliability.19,42 Although all raters agreed all rubrics were relevant, experience with each PA varied widely. No rater had previously used a GRS, the assessment with the least reliability in this study, suggesting experience with a PA contributes to rater reliability and agreement.44 Two raters also confessed to having an unfavorable mood when evaluating surgical performance. Reliability may have been improved in this study if raters were informed of the potential impact of cognitive biases as well as errors related to mood and attentiveness.14
IRR can be strengthened with rater training,12 and an ICC >0.8 is desirable, especially for high-stakes evaluations.3,5 Although our raters received written instructions on how to utilize each rubric, the ICC for each of our assessments was <0.5. While surgical PAs are considered criterion-referenced examinations that compare skills to a predetermined standard, raters must agree on the standard.4,9 Moderation involves discussion among raters intended to achieve a common understanding of scoring criteria and has been reported to improve the reliability of rubrics and rater assessments, providing students with accurate, consistent, and constructive scores.9 It is possible the assessments in the present study would have been more reliable if raters had discussed and agreed on the PA criteria and how to interpret and apply them before examination and scoring.9,10
One objective measure that has been correlated with markers of surgical proficiency is time to task completion.13 Although the times to complete each task corresponded to the difficulty of the task in our study, times to completion did not improve over time. Although most students reported practicing flow and motion, many students reported focusing more practice on other skills such as instrument and tissue handling and suture placement. Additionally, better assessment scores were correlated with both faster and slower completion times, and faster completion times were correlated to lower sEMG signal magnitudes for task 2 but higher sEMG signal magnitudes for task 3. While assessing time to completion may help infer skill level, completing a task quickly does not convey proficiency if errors are present,13 and it is not yet known whether certain surgical tasks require more or less time or muscle activity to be performed competently.28 The concept of a speed-accuracy trade-off is well established in the psychology literature; novices perform complex tasks more accurately when they take time to concentrate and think through the task. Conversely, experts actually perform complex tasks less accurately when forced to take more time, appearing to “get in their own way.” When experts perform tasks more quickly, muscle memory is thought to contribute to greater accuracy and precision in performance.47
Precisely performed surgical tasks can be segmented into component gestures such as tissue grasping, suturing, needle insertion, and suture grasping,28 and automated recognition by machine learning systems of these gestures can provide objective evaluations of surgical performance with consistent feedback.7,28 Extracted features of surgical videos, electroencephalographic data, kinematics, and EMG data have all been used as input to machine learning algorithms to classify surgical hand gestures,7,10,25,28,48 and lower sEMG signals were associated with greater surgical proficiency in one study.27 In the present study, we determined several time-domain features of EMG data including amplitude, variance, and length of the signal were sufficiently robust to train our classifier using machine learning.49 The classification model recognized and classified each surgical performance as task 1, 2, 3, or 4 with high accuracy, precision, and sensitivity across students, tasks, and weeks. However, surface EMG signals were not sufficiently different between students or weeks to compare skill levels or characterize the progression of suturing skills over time, and we did not aim to assess absolute surgical proficiency with EMG. A study of sEMG data from expert surgeons is needed to determine how much time and muscle activity are indicative of surgical proficiency for different surgical tasks and gestures.
Although in this feasibility study, we could not directly compare the reliability of expert rater assessment of each student’s suturing proficiency to the reliability of machine learning assessment, the sEMG data obtained from our students indicate kinematic and other physiologic data hold promise for improving assessments of technical skill through AI.28,29,44 Machine learning algorithms have aided or outperformed experts in the detection of breast cancer lesions, lymph node metastasis, pneumonia in people, reticulocytes in cats, and pathologic criteria of malignancy in dogs.38,50,51 Future investigations will be focused on gathering larger datasets from novice and expert surgeons, identifying and extracting features of the sEMG data that distinguish novice and expert surgical performance, and using these features to identify algorithms that classify surgical performance on an sEMG-derived scale of proficiency.27
Limitations of this study include a small sample size of students from a single institution and a limited number of expert raters. Two raters were from the same institution as the primary investigators and students, and a leniency bias for some scores cannot be ruled out. Rater training did not include a discussion and agreement on scoring criteria in advance of rater evaluations. By using a remote-controlled, pan, tilt, and zoom high-resolution digital video camera placed in the exact location over the right shoulder of every student, the investigators insured that an unobstructed, close view of the students’ hands and overall performance, and the entire tissue model, were consistently and clearly visible in all videos. Raters also could zoom in or out when viewing the videos on their own computers or devices. However, the field of view was pre-determined by the investigators, and this may have posed a limitation for some raters although no rater expressed this concern. The OSCE-style checklist, OSATS checklist, and surgical GRS were not previously validated in this student population but were directly derived from studies that demonstrated content and construct validity.11,17,30,31 The EMG armband, synthetic tissue model, and student and rater questionnaires have not been evaluated for validity or fidelity, and the use of machine learning to assess suturing skill requires suitable data to train available models such as CNN and LSTM models. While our data were of sufficient quality and quantity to train our classification model to differentiate and identify tasks accurately and precisely, one way the algorithm learns to classify is based on the number of padded zeros, representing time to task completion in our study and not purely the TD features of the sEMG signal. The majority of misclassification cases occurred when task 4 was falsely classified as task 2, likely due to the similarity in task completion times for these two tasks. Because the number of trials of task 4 was less than the number of trials of task 2, the model may have learned the characteristics of task 2 better than the characteristics of task 4, thereby misclassifying task 4 as task 2 and not task 2 as task 4. While the confusion matrices clearly demonstrate that despite their similar lengths, different TD features did not perform similarly in classifying different tasks, indicating the algorithm classified tasks using a combination of task completion time and each extracted feature, the applications and limitations of machine learning in the assessment of suturing skill are yet to be determined.
Interrater reliability and agreement were low for all PAs in this study and were lowest for the GRS, highlighting the potential for expert raters to compromise student evaluations by introducing subjectivity or bias to the application of objective rubrics for surgical skills assessment.4,45,46 While not performed in this study, engaging in moderation may improve rater reliability and should be considered for both low-stakes and high-stakes examinations of surgical skill.3,5,9,12 In the present study, the collection of sEMG data was both feasible and sufficiently robust to successfully train a classification model using machine learning. To determine if artificial intelligence and feature extraction of data may augment and further objectify assessments of suturing skill among veterinary students, studies using more complex tasks and kinematic and EMG data of a larger, more varied cohort of students, residents, and expert surgeons should be performed.7,26,28 Such studies may help determine if the classifiers used here can be trained to define and detect proficiency for different surgical tasks and gestures by differentiating novice, intermediate, and expert surgeons. Extracting frequency and time-frequency domain features and combining features may further improve the model. While AI-based assessments of technical skills may not replace the evaluation of trainees by expert mentors, and additional study is needed, machine learning technologies have potential as teaching and learning aids to improve the accuracy, efficiency, and consistency of assessments in veterinary surgical education.2,27,28
Supplementary Materials
Supplementary materials are posted online at the journal website: avmajournals.avma.org.
Acknowledgments
The authors acknowledge funding from the Jump Applied Research for Community Health through Engineering and Simulation (Jump ARCHES) program for this project.
Results were presented in part at the European College of Veterinary Surgeons Annual Scientific Meeting from July 5 to 7, 2022.
No author has a financial or personal conflict of interest.
The authors also thank Janet Sinn-Hanlon for assistance with figures, Jamie Hart for assistance with data collection, and Sherrie Lanzo for assistance with the design and fabrication of the synthetic tissue model.
References
- 1.↑
Johnson AL, Greenfield CL, Klippert L, et al. Frequency of procedure and proficiency expected of new veterinary school graduates with regard to small animal surgical procedures in private practice. J Am Vet Med Assoc. 1993;202(7):1068–1071.
- 2.↑
Simons MC, Hunt JA, Anderson SL. What’s the evidence? A review of current instruction and assessment in veterinary surgical education. Vet Surg. 2022;51(5):731–743. doi:10.1111/vsu.13819
- 3.↑
Dath D, Regehr G, Birch D, et al. Toward reliable operative assessment: the reliability and feasibility of videotaped assessment of laparoscopic technical skills. Surg Endosc. 2004;18(12):1800–1804. doi:10.1007/s00464-003-8157-2
- 4.↑
Liao SC, Hunt EA, Chen W. Comparison between inter-rater reliability and inter-rater agreement in performance assessment. Ann Acad Med Singap. 2010;39(8):613–618. doi:10.47102/annals-acadmedsg.V39N8p613
- 5.↑
Kramp KH, van Det MJ, Veeger NJ, Pierie JP. Validity, reliability and support for implementation of independence-scaled procedural assessment in laparoscopic surgery. Surg Endosc. 2016;30(6):2288–2300. doi:10.1007/s00464-015-4254-2
- 6.↑
Williamson JA, Farrell R, Skowron C, et al. Evaluation of a method to assess digitally recorded surgical skills of novice veterinary students. Vet Surg. 2018;47(3):378–384. doi:10.1111/vsu.12772
- 7.↑
Shafiei SB, Durrani M, Jing Z, et al. Surgical hand gesture recognition utilizing electroencephalogram as input to the machine learning and network neuroscience algorithms. Sensors (Basel). 2021;21(5):1733. doi:10.3390/s21051733
- 8.↑
Oviedo-Peñata CA, Tapia-Araya AE, Lemos JD, et al. Validation of training and acquisition of surgical skills in veterinary laparoscopic surgery: a review. Front Vet Sci. 2020;7:306. doi:10.3389/fvets.2020.00306
- 9.↑
Watari T, Koyama S, Kato Y, et al. Effect of moderation on rubric criteria for inter-rater reliability in an objective structured clinical examination with real patients. Fujita Med J. 2022;8(3):83–87. doi:10.20407/fmj.2021-010
- 10.↑
Zehra, T, Aaraj, S, Naeem, H, et al. OSCE grading: a cross-sectional study on inter-observer variability in assessment. Rawal Med J. 2022;47(3):723.
- 11.↑
Martin JA, Regehr G, Reznick R, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84(2):273–278. doi:10.1046/j.1365-2168.1997.02502.x
- 12.↑
Feldman M, Lazzara EH, Vanderbilt AA, DiazGranados D. Rater training to support high-stakes simulation-based assessments. J Contin Educ Health Prof. 2012;32(4):279–286. doi:10.1002/chp.21156
- 13.↑
Binkley J, Bukoski AD, Doty J, et al. Surgical simulation: markers of proficiency. J Surg Educ. 2019;76(1):234–241. doi:10.1016/j.jsurg.2018.05.018
- 14.↑
Braksick SA, Wang Y, Hunt SL, et al. Evaluator agreement in medical student assessment across a multi-campus medical school during a standardized patient encounter. Med Sci Educ. 2020;30(1):381–386. doi:10.1007/s40670-020-00916-1
- 15.↑
Schnabel LV, Maza PS, Williams KM, et al. Use of a formal assessment instrument for evaluation of veterinary student surgical skills. Vet Surg. 2013;42(4):488–496. doi:10.1111/j.1532-950X.2013.12006.x
- 16.↑
Orovec A, Bishop A, Scott SA, et al. Validation of a Surgical Objective Structured Clinical Examination (S-OSCE) using convergent, divergent, and trainee-based assessments of fidelity. J Surg Educ. 2022;79(4):1000–1008. doi:10.1016/j.jsurg.2022.01.014
- 17.↑
Harden RM, Gleeson FA. Assessment of clinical competence using an objective structured clinical examination (OSCE). Med Educ. 1979;13(1):41–54.
- 18.↑
Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65(9 Suppl):S63–S67. doi:10.1097/00001888-199009000-00045
- 19.↑
Cunnington JP, Neville AJ, Norman GR. The risks of thoroughness: reliability and validity of global ratings and checklists in an OSCE. Adv Health Sci Educ Theory Pract. 1996;1(3):227–233. doi:10.1007/BF00162920
- 20.↑
Davis MH, Ponnamperuma GG, McAleer S, Dale VH. The Objective Structured Clinical Examination (OSCE) as a determinant of veterinary clinical skills. J Vet Med Educ. 2006;33(4):578–587. doi:10.3138/jvme.33.4.578
- 21.↑
Malau-Aduli BS, Mulcahy S, Warnecke E, et al. Inter-rater reliability: comparison of checklist and global scoring for OSCEs. Creat Educ. 2012;3:937–942. doi:10.4236/ce.2012.326142
- 22.↑
Turner K, Bell M, Bays L, et al. Correlation between Global Rating Scale and specific checklist score for professional behavior of physical therapy students in practical examinations. Educ Res Internat. 2014;2014:1–6. doi:10.1155/2014/219512
- 23.↑
Hatala R, Cook DA, Brydges R, Hawkins R. Constructing a validity argument for the Objective Structured Assessment of Technical Skills (OSATS): a systematic review of validity evidence [published correction appears in Adv Health Sci Educ Theory Pract. 2015 Dec;20(5):1177-8]. Adv Health Sci Educ Theory Pract. 2015;20(5):1149–1175. doi:10.1007/s10459-015-9593-1
- 24.↑
van Hove PD, Tuijthof GJ, Verdaasdonk EG, Stassen LP, Dankelman J. Objective assessment of technical surgical skills. Br J Surg. 2010;97(7):972–987. doi:10.1002/bjs.7115
- 25.↑
Kowalewski KF, Garrow CR, Schmidt MW, et al. Sensor-based machine learning for workflow detection and as key to detect expert level in laparoscopic suturing and knot-tying. Surg Endosc. 2019;33(11):3732–3740. doi:10.1007/s00464-019-06667-4
- 26.↑
Basran PS, Appleby RB. The unmet potential of artificial intelligence in veterinary medicine. Am J Vet Res. 2022;83(5):385–392. 2022. doi:10.2460/ajvr.22.03.0038
- 27.↑
Pérez-Duarte FJ, Sánchez-Margallo FM, Martín-Portugués ID, et al. Ergonomic analysis of muscle activity in the forearm and back muscles during laparoscopic surgery: influence of previous experience and performed task. Surg Laparosc Endosc Percutan Tech. 2013;23(2):203–207. doi:10.1097/SLE.0b013e3182827f30
- 28.↑
Zappella L, Béjar B, Hager G, Vidal R. Surgical gesture classification from video and kinematic data. Med Image Anal. 2013;17(7):732–745. doi:10.1016/j.media.2013.04.007
- 29.↑
Wu Z, Li X. A wireless surface EMG acquisition and gesture recognition systems. International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). 2016:1675–1679.
- 30.↑
Read EK, Vallevand A, Farrell RM. Evaluation of veterinary student surgical skills preparation for ovariohysterectomy using simulators: a pilot study. J Vet Med Educ. 2016;43(2):190–213. doi:10.3138/jvme.0815-138R1
- 31.↑
Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Testing technical skill via an innovative "bench station" examination. Am J Surg. 1997;173(3):226–230. doi:10.1016/s0002-9610(97)89597-9
- 32.↑
Bieck R, Fuchs R, Neumuth T. Surface EMG-based surgical instrument classification for dynamic activity recognition in surgical workflows. Curr Directions Biomed Eng. 2019;5(1): 37–40. doi:10.1515/cdbme-2019-0010
- 33.↑
Chen W, Zhen Z. Hand gesture recognition using sEMG signals based on support vector machine. Paper presented at: IEEE 8th joint international information technology and artificial intelligence conference (ITAIC). May 24, 2019; Chongqing, China, pp. 230–234, doi:10.1109/ITAIC.2019.8785542
- 34.↑
Atzori M, Gijsberts A, Castellini C, et al. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Sci Data. 2014;1:140053. doi:10.1038/sdata.2014.53
- 35.↑
Patel K. A review on feature extraction methods. Int J Adv Res Electr Electron Instrum Eng. 2016;5:823–827. doi:10.15662/IJAREEIE.2016.0502034
- 36.↑
Sharif H, Eslaminia A, Chembrammel P, Kesavadas T. Classification of activities of daily living based on grasp dynamics obtained from a leap motion controller. Sensors (Basel). 2022;22(21):8273. doi:10.3390/s22218273
- 37.↑
Gamer M, Lemon J, Singh P. Various coefficients of interrater reliability and agreement. R package version 0.84.1. 2019. https://CRAN.R-project.org/package=irr
- 38.↑
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria; 2022. https://www.R-project.org
- 39.↑
Kassambara A. rstatix: pipe-friendly framework for basic statistical tests. R package version 0.7.0. 2021. https://CRAN.R-project.org/package=rstatix
- 40.↑
Wickham H. ggplot2: elegant graphics for aata analysis. 2016. Springer-Verlag. https://ggplot2.tidyverse.org
- 41.↑
Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research [published correction appears in J Chiropr Med. 2017 Dec;16(4):346]. J Chiropr Med. 2016;15(2):155–163. doi:10.1016/j.jcm.2016.02.012
- 42.↑
Tavakol M, Pinner G. Enhancing objective structured clinical examinations through visualisation of checklist scores and global rating scale. Int J Med Educ. 2018;9:132–136. doi:10.5116/ijme.5ad4.509b
- 43.↑
Aubreville M, Bertram CA, Marzahl C, et al. Deep learning algorithms out-perform veterinary pathologists in detecting the mitotically most active tumor region. Sci Rep. 2020;10(1):16447. doi:10.1038/s41598-020-73246-2
- 44.↑
Moorthy K, Munz Y, Sarker SK, Darzi A. Objective assessment of technical skills in surgery. BMJ. 2003;327(7422):1032–1037. doi:10.1136/bmj.327.7422.1032
- 45.↑
Mortsiefer A, Karger A, Rotthoff T, Raski B, Pentzek M. Examiner characteristics and interrater reliability in a communication OSCE. Patient Educ Couns. 2017;100(6):1230–1234. doi:10.1016/j.pec.2017.01.013
- 46.↑
Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm. 2013;9(3):330–338. doi:10.1016/j.sapharm.2012.04.004
- 47.↑
Beilock SL, Bertenthal BI, McCoy AM, Carr TH. Haste does not always make waste: expertise, direction of attention, and speed versus accuracy in performing sensorimotor skills. Psychon Bull Rev. 2004;11(2):373–9. doi:10.3758/bf03196585
- 48.↑
Lee AR, Cho Y, Jin S, Kim N. Enhancement of surgical hand gesture recognition using a capsule network for a contactless interface in the operating room. Comput Methods Programs Biomed. 2020;190:105385. doi:10.1016/j.cmpb.2020.105385
- 49.↑
Kanstrén T. A look at precision, recall, and f1-score: exploring the relations between machine learning metrics. TowardsDataScience. 2020. https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec
- 50.↑
Boissady E, de La Comble A, Zhu X, Hespel AM. Artificial intelligence evaluating primary thoracic lesions has an overall lower error rate compared to veterinarians or veterinarians in conjunction with the artificial intelligence. Vet Radiol Ultrasound. 2020;61(6):619–627. doi:10.1111/vru.12912
- 51.↑
Vinicki K, Pierluigi F, Maja B, Romana T. Using convolutional neural networks for determining reticulocyte percentage in cats. arXiv. 2018;1803.04873.