Research Assistant Cohen Children's Medical Center New Hyde Park, New York, United States
Background: The increased availability of artificial intelligence (AI), like public chatbots such as ChatGPT, has piqued researchers' interest in its potential role in addressing health concerns. Just as people turn to Google for medical information, they may also soon use AI tools for answers to their medical questions. It is therefore vital to understand the quality of medical information generated by ChatGPT, especially in domains where people often seek reassurance, like parents worried about their child's health. Objective: This study aims to compare the readability and accuracy of pediatrician and ChatGPT-generated responses to commonly asked pediatric questions. Design/Methods: General pediatrics questions and pediatrician responses (n = 25) were extracted from the American Academy of Pediatrics’ (AAP) www.healthychildren.org . Questions were entered into ChatGPT-3.5 with the prompt “How would you respond to a parent asking the following question: [insert question]?” Two blinded board-certified pediatricians rated the medical correctness and completeness of both ChatGPT and pediatrician answers on a 5-point Likert scale. Readability was assessed using average word count, Flesch Reading Ease (FRE) score, and Flesch-Kincaid Grade Level (FKGL) (Table 1). Raters were also asked which response they preferred. Results: A total of 50 responses (25 by ChatGPT; 25 by pediatricians) were included in the final analysis. The medical correctness ChatGPT’s responses (4.8 ± 0.58) did not differ from the correctness of the AAP pediatrician responses (4.8 ± 0.50; t=0.0, p= 1.0). Similarly the completeness of ChatGPT’s responses (4.8 ± 0.41) did not different from the completeness of the AAP pediatrician responses (4.56 ± 0.71; t= 1.44, p= 0.15). However, when compared to pediatrician responses, ChatGPT responses had a significantly higher FKGL score (12.58 vs. 9.06; p< 0.0001) and significantly lower FRE score (37.30 vs. 56.74; p< 0.0001). ChatGPT responses and pediatrician responses did not differ in average word count (386 vs. 449; p= 0.095). Blinded physician scorers generally preferred ChatGPT responses, choosing the chatbot’s response 17/25 times.
Conclusion(s): While ChatGPT and pediatrician responses had similar medical correctness and completeness, they differed in several metrics of readability. ChatGPT responses were generally of a higher grade level than pediatrician responses and thus displayed lower reading ease. Though physicians generally preferred ChatGPT to pediatrician responses, many parents or patients could struggle with the more advanced responses that ChatGPT provides.