Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information

作者全名:"Wang, Guoyong; Gao, Kai; Liu, Qianyang; Wu, Yuxin; Zhang, Kaijun; Zhou, Wei; Guo, Chunbao"

作者地址:"[Wang, Guoyong; Wu, Yuxin; Zhang, Kaijun] Chongqing Med Univ, Childrens Hosp, Chongqing, Peoples R China; [Wang, Guoyong; Liu, Qianyang; Zhou, Wei; Guo, Chunbao] Chongqing Med Univ, Women & Childrens Hosp, Chongqing, Peoples R China; [Gao, Kai] Guangzhou Med Univ, Guangzhou Women & Childrens Med Ctr, Guangzhou, Peoples R China; [Guo, Chunbao] Chongqing Med Univ, Affiliated Hosp 2, Dept Neurosurg, Women & Childrens Hosp, 120 Longshan Rd,Longshan St, Chongqing 400010, Peoples R China"

通信作者:"Guo, CB (通讯作者),Chongqing Med Univ, Affiliated Hosp 2, Dept Neurosurg, Women & Childrens Hosp, 120 Longshan Rd,Longshan St, Chongqing 400010, Peoples R China."

来源:JOURNAL OF MEDICAL INTERNET RESEARCH

ESI学科分类:CLINICAL MEDICINE

WOS号:WOS:001126859300003

JCR分区:Q1

影响因子:7.4

年份:2023

卷号:25

期号: 

开始页: 

结束页: 

文献类型:Article

关键词:ChatGPT 3.5; ChatGPT 4.0; artificial intelligence; AI; COVID-19; pandemic; public health; information retrieval

摘要:"Background: The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has necessitated reliable and authoritative information for public guidance. The World Health Organization (WHO) has been a primary source of such information, disseminating it through a question and answer format on its official website. Concurrently, ChatGPT 3.5 and 4.0, a deep learning-based natural language generation system, has shown potential in generating diverse text types based on user input. Objective: This study evaluates the accuracy of COVID-19 information generated by ChatGPT 3.5 and 4.0, assessing its potential as a supplementary public information source during the pandemic. Methods: We extracted 487 COVID-19-related questions from the WHO's official website and used ChatGPT 3.5 and 4.0 to generate corresponding answers. These generated answers were then compared against the official WHO responses for evaluation. Two clinical experts scored the generated answers on a scale of 0-5 across 4 dimensions-accuracy, comprehensiveness, relevance, and clarity-with higher scores indicating better performance in each dimension. The WHO responses served as the reference for this assessment. Additionally, we used the BERT (Bidirectional Encoder Representations from Transformers) model to generate similarity scores (0-1) between the generated and official answers, providing a dual validation mechanism. Results: The mean (SD) scores for ChatGPT 3.5-generated answers were 3.47 (0.725) for accuracy, 3.89 (0.719) for comprehensiveness, 4.09 (0.787) for relevance, and 3.49 (0.809) for clarity. For ChatGPT 4.0, the mean (SD) scores were 4.15 (0.780), 4.47 (0.641), 4.56 (0.600), and 4.09 (0.698), respectively. All differences were statistically significant (P<.001), with ChatGPT 4.0 outperforming ChatGPT 3.5. The BERT model verification showed mean (SD) similarity scores of 0.83 (0.07) for ChatGPT 3.5 and 0.85 (0.07) for ChatGPT 4.0 compared with the official WHO answers. Conclusions: ChatGPT 3.5 and 4.0 can generate accurate and relevant COVID-19 information to a certain extent. However, compared with official WHO responses, gaps and deficiencies exist. Thus, users of ChatGPT 3.5 and 4.0 should also reference other reliable information sources to mitigate potential misinformation risks. Notably, ChatGPT 4.0 outperformed ChatGPT 3.5 across all evaluated dimensions, a finding corroborated by BERT model validation."

基金机构:"National Natural Science Foundation of China [30973440, 30770950]; Ministry of Key Laboratory of Child Development and Disorders; Chongqing Natural Science Foundation [CSTB2022NSCQ-MSX0819, YBRP-2021XX]; [cstc2020jcyj-msxmX0326]"

基金资助正文:"The authors are deeply indebted to the advancements in machine learning and artificial intelligence for bolstering the methodological framework of this study. Specifically, the authors used the ChatGPT 3.5 and 4.0 language models to autonomously generate the questions that served as the cornerstone of our evaluation metrics. The generated text and prompt words from these models can be found in Multimedia Appendices 2 and 5, respectively. Concurrently, we used bidirectional encoder representations from transformers (BERT) algorithms for the quantitative evaluation of text quality. Detailed metrics, including BERT scores, are available in Multimedia Appendix 5. This computational approach underwent rigorous statistical scrutiny, which was instrumental in enhancing both the analytical rigor and methodological precision of our research. The ""nonhuman assistance"" provided by these advanced algorithms was indispensable in elevating the academic quality of our study. The study received funding from several sources. The National Natural Science Foundation of China (grants 30973440 and 30770950) supported the data collection, analysis, and interpretation. The Ministry of Key Laboratory of Child Development and Disorders provided funding through the Youth Basic Research Project (grant YBRP-2021XX) . Additionally, the preparation of the paper was funded by key projects of the Chongqing Natural Science Foundation, specifically grants cstc2020jcyj-msxmX0326 and CSTB2022NSCQ-MSX0819. The funding agency paid for the scholarships of the students involved in the research."