41.06 Large Language Model Performance in Evaluating Descriptions of Surgical Procedures

R.D. Zeh¹, K. Lee¹, T. Wilson¹ ¹University of Pittsburgh Medical Center, Dept. Of Surgery, Pittsburgh, PA, USA

Introduction:

Large Language Models (LLM) are evolving to have a more impactful role in surgical education. The most popular publicly-available LLM, ChatGPT, has been studied in various contexts related to surgical education. These studies include but are not limited to its performance on the ABSITE exam, its ability to perioperatively risk stratify, decision-making in common surgical and breast cancer scenarios, and in its ability to answer potential patient questions following thyroid surgery. While the results and conclusions of these studies vary based on the outcome measured, ChatGPT generally performs well. Given its potential utility as an educational tool in several surgical domains, we sought to determine how it performs at critiquing descriptions of operations in the setting of surgical oral boards to a high degree of detail.

Methods:
A prompt to ChatGPT was engineered indicating that it was to act as a general surgery boards examiner tasked with critiquing descriptions of 3 operations: laparoscopic cholecystectomy, radiocephalic arteriovenous fistula creation, and laparoscopic right colectomy. Purposeful errors were introduced into the operative descriptions in 3 domains: anatomical, technical, and equipment-related errors. The scenarios were each presented to GPT in 10 iterations for each error domain for all 3 of these operations. The frequency with which ChatGPT corrected the user on these deliberate mistakes was recorded.

Results:
ChatGPT lacked 100% sensitivity for detecting errors in only one domain, which was the identification of the critical view of safety prior to clipping any arterial or ductal structures during a laparoscopic cholecystectomy, which it detected with 60% sensitivity. Otherwise, it caught all inserted errors for descriptions of AV fistula creation and right hemicolectomy, and it caught all equipment-related and anatomy-related errors. Furthermore, these errors were clearly articulated back to the user with explanations.

Conclusion:
The impact of LLMs, specifically ChatGPT, in clinical and surgical education continues to evolve, improve and expand. This study demonstrates GPT4o is a potentially viable resource for feedback when practicing describing operations for the oral boards. Future directions include further performance validation across diverse specialties and operations, across other types of errors, across the creation of custom-GPTs with specific common scenarios, and ultimately a comprehensive LLM powered tool for oral board prep.

Related

Archives Available / Site Navigation

41.06 Large Language Model Performance in Evaluating Descriptions of Surgical Procedures

Related

Follow the Association for Academic Surgery

Follow the Society of University Surgeons

Archives Available / Site Navigation