N. B. Verhagen1, N. Wolfrath1, T. Sathe2, M. Shah2, K. Nimmer1, C. Tomlin3, J. Peschman1, P. Murphy1, S. Dream1, A. Kothari1 1Medical College Of Wisconsin, Department Of Surgery, Milwaukee, WI, USA 2University Of California – San Francisco, Department Of Surgery, San Francisco, CA, USA 3Independent Software Engineer, Milwaukee, WI, USA
Introduction: ChatClinic is a medical education platform that employs large language models (LLMs) to create AI-Simulated Clinical Interactions (ASCI). Users of ChatClinic interact with virtual patients in real-time to gather medical histories, order diagnostic tests, and formulate treatment plans before receiving adaptive feedback on the interaction. This study aimed to assess the quality of ASCIs delivered by ChatClinic using both LLMs and human expert review.
Methods: ChatClinic was used to simulate three surgical ASCIs: acute cholecystitis, appendicitis, and diverticulitis. Each simulation was conducted using a standardized clinical dialogue for consistent interactions. Five virtual patients were generated for each condition, resulting in a total of 15 unique interactions. ASCI transcripts were then evaluated by three LLMs (GPT-4, Claude 2, PaLM-2) and five human domain experts (faculty surgeons) across the following domains using a 5-point scale for each (5=strongly agree): factual accuracy, comprehension, reasoning, and potential harm or bias.
Results: The average composite score across all evaluators was 4.3 (SD=0.5) and the mean factual accuracy was 3.92 (SD=0.81), comprehension 4.7 (SD=0.5), reasoning 4.3 (SD=0.7), and minimal potential for harm or bias 4.3 (SD=0.8). LLMs provided a higher composite score than human experts (4.6 vs. 4.2, P<.001). While LLMs and human experts provided similar scores for comprehension (4.7 vs. 4.7) and reasoning (4.4 vs. 4.3), LLMs provided higher scores for factual accuracy (4.5 vs. 3.5, P<.001) and potential for harm or bias (4.9 vs. 3.9, P<.001) when compared to human expert review. When comparing performance based on the type of ASCI condition, appendicitis was the highest rated (4.4, SD=0.4), followed by diverticulitis (4.3, SD=0.4) and cholecystitis (4.1, SD=0.3). LLM evaluators gave the diverticulitis ASCI the top score of 4.7, while cholecystitis received the lowest at 4.6. The human experts rated appendicitis highest at 4.3 and cholecystitis the lowest at 3.7.
Conclusion:
ChatClinic produces high-quality, simulated patient interactions based on commonly encountered acute surgical conditions with strong comprehension and reasoning. Evaluation scores varied between LLM and human expert evaluators, with LLMs assigning higher scores, most notably in the domain of factual accuracy. Careful assessment and refinement of AI-based educational tools that includes human involvement are crucial prior to their broad deployment and use.