01.01 Assessing the Performance of Artificial Intelligence on Pediatric Surgery Fellowship-Level Questions

R. Sachs1, P. Truche1, J. Lee1, S. Burjonrappa2, A. Thenappan2  1Robert Wood Johnson – UMDNJ, General Surgery, New Brunswick, NJ, USA 2Robert Wood Johnson – UMDNJ, Pediatric Surgery, New Brunswick, NJ, USA

Introduction:  Recent advancements in artificial intelligence (AI) have led to the development of systems capable of passing the USMLE examinations. This suggests a future role for AI in medical education and clinical practice. This role, however, has yet to be defined and its limitations are being explored. This study aims to evaluate the accuracy of AI systems in answering pediatric surgery fellowship-level questions.

Methods:  We assessed the performance of the ChatGPT-4 large language model on pediatric surgery fellowship-level questions. No additional training or input was provided prior to assessment. A total of 419 multiple-choice pediatric surgery fellowship-level questions were obtained from the ACS SCORE curriculum and categorized into Operative/procedural decisions, Diagnostic decision-making, Pharmacology, Embryology/Anatomy, Pathophysiology, and Epidemiology/Statistics. The model's accuracy in identifying the correct multiple-choice answer was calculated as well as compared amongst the different categories.

Results: ChatGPT-4 accurately answered 52.5% of the pediatric surgery fellowship-level questions overall (220/419). The highest accuracy was achieved in the pharmacology category (65% correct), followed by anatomy/embryology (57.5%) and pathophysiology (53.52%). In contrast, the model exhibited lower performance in the diagnostic decision-making (51.85%) and operative/procedural decisions (51.06%) categories.

Conclusion: The performance of AI systems such as ChatGPT-4 was average in answering pediatric surgery fellowship-level questions. It also remains inconsistent across various question categories. Notably it was least successful in categories that primarily require decision making. These results highlight the limitations of AI in replicating the expertise and critical thinking skills of pediatric surgery trainees and surgeons. As such, caution and further investigation is necessary prior to the integration of AI systems into medical education and clinical practice.