M. Shah1, T. Sathe1, C. Silvestri1, N. B. Verhagen2, N. Wolfrath2, S. Krishnamoorthy1, A. Kothari2 1Columbia University College Of Physicians And Surgeons, Department Of Surgery, New York, NY, USA 2Medical College Of Wisconsin, Milwaukee, WI, USA
Introduction:
The recent rise of large language models (LLMs) has expanded the opportunity to incorporate AI into many different sectors; however, adoption in medicine remains cautious. Following the release of OpenAI's ChatGPT and other transformer-based LLMs, this new era of models can recognize context and accurately retrieve knowledge, demonstrated by an ability to pass medical board examinations. We wanted to understand whether LLMs could accurately simulate a more complex and dynamic examination format such as the American Board of Surgery Certifying Exam (ABS CE).
Methods:
We built a custom chatbot using the GPT-4 LLM by OpenAI. The chatbot was designed to simulate the experience of taking the ABS CE by guiding users through cases in the style of surgical oral boards. The bot was trained to provide users with a case introduction and answer questions on history, physical exam, labs, imaging, and studies when requested. It then asks users to walk through the appropriate preoperative setup, steps of the operation, and postoperative care. At the end of each case, the chatbot created a multiple-choice question that synthesized the themes of the case and finally asked users for feedback. We asked surgical residents and faculty to try the chatbot and provide early feedback to guide iterative design. The primary objective of this analysis was to determine user engagement and feedback on the educational instrument.
Results:
Here, we present a novel proof-of-concept for an LLM-powered chatbot simulating the ABS CE experience. In a pilot demonstration, 22 simulations were completed by eight users. The average number of exchanges in each conversation was 10.4, ranging from 7 to 16, including both chatbot and user messages. User-feedback centered on three areas: (1) improving the user experience, (2) adding new features such as topic and difficulty selection, and (3) ensuring accuracy of responses.
Conclusion:
Our initial engagement demonstrates that a GPT-4 powered chatbot for board preparation is a viable application of LLM technology in surgical education. Feedback from users thus far is favorable, supporting the appetite for more dynamic tools that present scenarios in the style of oral boards. Future directions will include validating the accuracy and appropriateness of the chatbot’s prompts and responses through expert review of the conversation transcripts.