60.07 Comparative LLM Analysis: Evaluating GPT-4o and Claude 3.5 Sonnet for First-Aid Scenario Guidance

N.K. West Jr.², A.S. Edwards³, H.S. Lucas⁵, J.S. O’Brien⁴, J.S. Sims¹, J.S. Upperman¹ ¹Vanderbilt University Medical Center, Pediatric Surgery, Nashville, TN, USA ²Yale university, New Haven, CT, USA ³Tennesse State University, Nashville, TENNESSEE, USA ⁴Columbia University College Of Physicians And Surgeons, Public Health, New York, NY, USA ⁵Brandeis University, Boston, MA, USA

Introduction: Large Language Models (LLMs) have the potential to provide immediate first-aid advice, especially in regions with limited access to professional medical services. In mass casualty events, where professional help may be delayed, bystander first aid is critical. However, the efficacy and safety of LLMs in guiding such interventions remain unexamined. We hypothesized that there may be variability in LLM performance with common first-aid scenarios. In order to test this hypothesis, we evaluated the performance of two LLMs, GPT-4o and Claude 3.5 Sonnet, using five first-aid case vignettes.

Methods: Five diverse first-aid case vignettes were created using LLMs and vetted by American Red Cross experts for realism and applicability. We then evaluated the performance of GPT-4o and Claude 3.5 Sonnet on these vignettes using the standard accuracy LLM metric. Each model was tested three times per vignette. We assessed the quality of responses on a Likert scale of 0 to 2, with 0 denoting incorrect, 1 denoting partially correct, and 2 denoting fully correct, across the following criteria: diagnostic accuracy, first-aid advice accuracy, and triage accuracy, as well as safety, comprehensiveness, and consistency. Each criterion was evaluated separately using the standard accuracy LLM metric. Additionally, a cross-evaluation was conducted where each LLM assessed the other's performance across all 30 interrogation responses.

Results: Claude 3.5 Sonnet outperformed GPT-4o in first-aid accuracy, comprehensiveness, and consistency, while both models performed equally well in diagnostics, triaging, and safety. Both models consistently identified the primary condition or emergency within each vignette and appropriately recommended seeking professional medical help. First-aid recommendations aligned with the American Red Cross' most recent guidelines, though some key first-aid steps were omitted in a few interrogations. Cross-evaluation yielded consistent ratings of "fully correct and comprehensive" for all six criteria, further validating the high quality of the generated first-aid guidance.

Conclusion: This study provides initial evidence of the potential utility of advanced LLMs in guiding first-aid interventions, with Claude 3.5 Sonnet demonstrating superior performance in key areas. These findings suggest LLMs could serve as valuable supplementary tools in emergencies. However, further research is necessary to validate these results and address potential limitations before considering widespread implementation.

Related

Archives Available / Site Navigation

60.07 Comparative LLM Analysis: Evaluating GPT-4o and Claude 3.5 Sonnet for First-Aid Scenario Guidance

Related

Follow the Association for Academic Surgery

Follow the Society of University Surgeons

Archives Available / Site Navigation