61.05 Simulating Goals of Care Discussions: Leveraging Language Models for Communication Skills Training

S. Menon1,2, G. Mertz1,2, G.A. Del Carmen1,2, G.A. Del Carmen1  1Albany Medical Center, Department Of Surgery, Albany, NY, USA 2Albany Medical College, Albany, NY, USA

Introduction:

Goals of care (GOC) discussions are essential in the management of patients with surgically unresectable malignancies. However, these conversations are often an immense challenge for providers. Large language models (LLMs), such as OpenAI’s GPT models, offer a dynamic adjunct to traditional simulation training for surgical residents by providing dynamic, natural text generation which can convey reasoning. Leveraging the capabilities of agentic artificial intelligence (AI), which enables LLMs to assume specific roles and exhibit goal-oriented behaviors, we sought to evaluate the viability of GOC discussion simulations with this technology. 

 

Methods:

Utilizing Microsoft’s Autogen, an agent-based framework, we enabled LLMs to simulate self-driven GOC dialogues. OpenAI’s GPT-4 model acted as both conversation participants. The Physician agent assumed the role of surgical oncologist with the goal to discuss treatment care options, and the Patient agent simulated a range of plausible responses: anger, sadness, acceptance, denial, and anxiety. We conducted ten simulations per emotional condition, recording and analyzing each interaction for lexical properties such as message length (measured in characters) and question frequency. Analysis of variance (ANOVA) were used to determine significance in variation of communicative behaviors across conditions and agents.  

 

Results:

Our analysis identified significant variations in questioning behaviors of the Physician agent across emotional conditions, with the highest frequency during the denial state and the lowest during the anger state (2.36 and 1.10, P=0.006). Physician responses, measured in character count, were significantly longer than those of the Patient (756 vs 506, P<0.001). On average, the Patient asked significantly more questions compared to the Provider (57.0 and 1.93, P<0.001). Though this only trended towards significance, the “angry” Patient asked the most questions while the “accepting” Patient asked the fewest (7.40 and 3.10, P=0.07). 

 

Conclusion:

We demonstrate the plausibility of LLMs as a pedagogical tool to simulate and analyze complex and emotionally charged medical dialogues for surgical residents. Further, our findings provide novel insights into the intersection of surgical practice and advanced AI, providing a foundation to understand how machine learning processes can develop, integrate, and understand sensitive clinical interactions.