S. H. Qazi1, A. SHAFIQ2, S. GHAZANFAR2, Z. H. HOODBHOY3, S. D. GOLDSTEIN4, C. A. Colunga Tinajero5, S. ISLAM1, J. K. DAS2 5Antiguo Hospital Civil de Guadalajara, Guadalajara, SURGERY, Guadalajara, JALISCO, Mexico 1AGA KHAN UNIVERSITY, SURGERY, KARACHI, SIND, Pakistan 2AGA KHAN UNIVERSITY, INSTITUTE OF GLOBAL HEALTH & DEVELOPMENT, KARACHI, SIND, Pakistan 3AGA KHAN UNIVSITY, PEDIATRICS, KARACHI, SIND, Pakistan 4Northwestern University Feinberg School of Medicine, SURGERY, CHICAGO, ILLINOIS, USA
Introduction:
The advancement of artificial intelligence (AI), especially Large Language Models (LLMs) like ChatGPT, have sparked debates about their valuable influence. Such LLMs pose a potential to revolutionize academic writing such as systematic reviews, as well as aid in literature reviews. It is crucial to examine how technologies like AI and LLMs might benefit researchers in academic writing using systematic reviews as a case study.
The aim of this study is to employ ChatGPT and compare the elements of human-written systematic reviews with AI-written systematic reviews in the field of pediatric gastrointestinal (GI) surgery.
Methods:
35 systematic reviews were retrieved and assessed from three pediatric surgery journals between January 2021 till April 2023, and a total of 19 published systematic reviews that met the eligibility criteria were included. GPT-4 was utilized to create AI-written reviews, and was compared based on their objectives, inclusion & exclusion criteria, search strategy, risk of bias assessment, study selection, analysis, outcomes, findings, conclusions, accuracy of reference, quality of writing, and supplemental materials. A descriptive analysis was conducted on these components.
Results:
While some AI written reviews didn't explicitly use the population, intervention, comparison, outcome (PICO) framework, the objectives of the AI-written reviews were very similar to those of the human-written reviews. The literature review generated by AI was detailed and the study search conducted by AI did not yield the same reference articles, it differed vastly in their results, as the mean number of studies in human written reviews was 31.84 studies, whereas the mean number of studies in AI reviews was 20.57 studies. However, the inclusion and exclusion criteria were somewhat similar for both reviews. The risk of bias assessment in AI reviews used fewer tools as compared to human written reviews, although the AI generated did not specify the outputs of these risk of bias assessments. The analysis results varied, as some yielded similar results while others had marked differences. The AI-written reviews though mentioned using statistical analysis like meta-analysis but never generated and reported specific estimates for any outcomes but just generated conclusions.
Conclusion:
Although there were differences between reviews produced by AI and by humans, AI shows future promise with advancement in technology and as a useful tool for academia as it can assist in many resource-intensive processes like search and extractions if not the complete systematic reviews. The findings of this study also add to the ongoing conversation about artificial intelligence's place in research and sheds light on the potential uses of such AI tools in academia.