12.18 Automating Content Analysis and Improving Readability of Plastic Surgery Websites with ChatGPT

J. E. Fanning1, M. J. Escobar-Domingo1, J. Foppiani1, D. Lee1, A. S. Miller1, S. J. Lin1, B. T. Lee1  1Beth Israel Deaconess Medical Center, Plastic Surgery, Boston, MA, USA

 

Introduction:

The quality and readability of online information on aesthetic procedures are sometimes suboptimal, affecting their relevance to broad patient audiences. Numerous studies have conducted content and readability analyses, but traditional methods for evaluating online content are time-consuming and prone to human error. This study aims to analyze readability improvement of private-practice plastic surgery webpages comparing manual versus ChatGPT-automated content assessments.

Methods:

We employed Google via Startpage.com to ensure depersonalized search results for the first 70 results of “breast implant size factors” and “breast implant size decision,” each. Advertisements, unrelated webpages, non-US practices, and non-practice webpages were excluded. Two authors (J.E.F. and M.J.E.) manually analyzed texts for the presence of  twelve patient decision-making factors of breast implant size selection. ChatGPT 3.5 and 4.0 were utilized with two levels of prompts (1: general instructions, and 2: specific instructions) to automate content analysis and rewrite webpage texts with improved readability. ChatGPT content analysis outputs were classified as hallucination (false positive), accurate (true positive/negative), or omission (false negative) by using human-rated content scoring as a benchmark. Six readability metric scores of original and revised webpage texts were obtained with Readability Studio software. Friedman’s two way analysis of variance by ranks was utilized to compare baseline readability scores and ChatGPT generated webpage readability scores. 

Results:

Seventy-four US plastic surgery practice webpages on breast implant size selection were included. We observed significant improvements from baseline in six readability metric scores using a specific-instructions prompt with ChatGPT 3.5 (all p < .05). No consistent improvements in readability scores were achieved with ChatGPT 4.0 outputs compared to ChatGPT 3.5 with a specific instructions prompt (p > .05). Rates of hallucination, accuracy, and omission in ChatGPT content scoring varied widely between decision-making factors. Accuracy and hallucination rates increased while omission rates decreased in ChatGPT 4.0 content analysis output.

Conclusion

Leveraging ChatGPT for automated content analysis and readability improvement offers an innovative approach to enhancing the quality of online medical information and expanding the capabilities of online health research. Content analysis is limited by ChatGPT3.5’s high omission rates and ChatGPT4.0’s high hallucination rates. Our results also underscore the importance of iterative prompt design to optimize ChatGPT performance in research tasks. Further studies are required to validate its potential for further adoption in diverse medical research applications.