113.07 Evaluating ChatGPT’s Performance on a Situational Judgment Test for Non-Cognitive Skills Assessment

A. G. Atkinson1, D. L. Dent1, M. Kitano1, J. W. Kempenich1  1University Of Texas Health Science Center At San Antonio, San Antonio, TX, USA

Introduction:  Situational Judgment Tests (SJT) are used to evaluate non-cognitive attributes related to professionalism. There is increasing use of SJTs in the medical school and residency selection processes as part of an applicant’s holistic review. In the surgical literature, there is evidence that the use of SJTs to evaluate residency candidates decreases attrition and increases diversity. A recent study found that ChatGPT, an advanced large language model, performed well on the SJT required for medical students in the United Kingdom. This study aims to evaluate the performance of ChatGPT on the SJT developed by the Association of American Medical Colleges (AAMC), used to assess eight core competencies (service orientation, social skills, cultural competence, teamwork, ethical responsibility to self and others, reliability and dependability, resilience and adaptability, capacity for improvement).

Methods:  All item response questions (n=186) associated with 30 scenarios provided on the AAMC practice SJT were individually inputted into ChatGPT as prompts. Each item was a potential action that could be taken in response to the associated scenario. Each prompt included instructions to rate the effectiveness of each given response on a four-point scale (very ineffective to very effective). ChatGPT output also included a rationale for the effectiveness rating generated. ChatGPT ratings were generated three times to evaluate response variability. The median output was compared to the corresponding item answer from the AAMC scoring key, which had been determined by subject matter experts (SME). Full credit was given to outputs that exactly matched the scoring key, and partial credit was given to outputs close to the SMEs’ response on the scoring key. ChatGPT-generated and scoring key rationales for each question were compared.

Results: The three ratings generated per question had substantial agreement (κ=0.838, p<0.001). ChatGPT answered 54.3% (101/186) completely correct, 40.3% (75/186) partially correct, and 5.4% (10/186) incorrectly. Six of the incorrect answers were directly related to teamwork competency.

Conclusion: ChatGPT performed well on a test that measured noncognitive skills required for medical professionalism. Nearly 95% of the generated outputs matched or partially matched the ratings provided by medical educators. Rating discrepancies may suggest bias in ChatGPT's training data, disagreement on competency priority, or a deficit in discerning the nuances of higher-level social linguistics. There is a potential application for the use of large language models in interactive ethics simulations. In the selection process, this could allow for programs to evaluate candidates’ communication and professional skills in a more dynamic manner than current SJTs offer. However, future studies are required to investigate and mitigate the underlying bias in this emerging technology.