47.08 Readability and Quality of FAQs about Adrenal Disorders: Comparing ChatGPT 4.0 to Content Experts

J.A. Kasmirski1, A.A. Harsono1, C. Wu1, Z. Song1, A. Gillis1, J. Fazendin1, H. Chen1, B. Lindeman1  1University Of Alabama at Birmingham, Department Of Surgery, Birmingham, Alabama, USA

Introduction: Patients with adrenal disorders may have difficulty accessing a high-volume specialist, which could negatively affect patient care, particularly among those with low health literacy. Artificial intelligence (AI) is being used to meet this need, however, it is unclear if AI adequately answers patients' concerns. This study aims to analyze and compare the readability and quality of the responses available to frequently asked questions (FAQ) about adrenal disorders between the American Association of Endocrine Surgeons (AAES) website with those generated by ChatGPT 4.0.

 

Methods: FAQs about the adrenal gland on the AAES website were utilized as prompts to generate ChatGPT 4.0 responses. Readability of 14 questions belonging to the FAQ section answers from the AAES and ChatGPT 4.0 were assessed using several instruments: Flesch-Kincaid Grade Level (FKGL), Follo Smog Index (SMOG), Gunning Fog Index (GF), Automated Readability (ARI), and Coleman-Liau Index (CLI), for all of which a low score equates to better readability. Both sets of responses were evaluated according to the DISCERN instrument (where a higher score is favorable), utilizing applicable criteria to assess the materials by three independent raters. DISCERN is a tool designed to judge the quality of consumer health information materials. Intraclass correlation coefficient (ICC) test was used to calculate inter-rater reliability between raters. The median and interquartile range of the readability scores were calculated and compared using the Mann-Whitney U test.

 

Results: The question responses from both AAES website and ChatGPT were overall “difficult to read” and above high school reading level. However, in most readability parameters the AAES website scored better on average than ChatGPT[median FKGL 12.4 (11.0-14.6) vs 15.7 (11.8-19.5), SMOG 13.5 (13.0-14.6) vs 16.5 (14.0-19.4), ARI 11.8 (9.4-13.3) vs 15.2 (11.7-22.6), GFI 14.0 (12.8-16.0) vs 17.6 (14.0-23.8), CLI 12.6 (11.5-13.7) vs 15.6 (14.2-19.2), respectively, p<0.05 for all, except FKGL, p=0.06]. The average DISCERN score for each question was 2 (1-5) for AAES answers and 1 (1-3) for ChatGPT 4.0-generated answers (p=<0.001). The inter-rater agreement was very high (Κ =0.94, 95% C.I. 0.91-0.96).

 

Conclusion: Expert-generated content on the AAES FAQ website scored better on average than ChatGPT' 4.0-generated responses in readability metrics and the DISCERN instrument criteria. In conclusion, ChatGPT's 4.0-generated responses demonstrate a need for more accessible and thorough answers to adrenal disorder questions in order to respond satisfactorily and become a reliable source of information for patients.