Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

作者全名:Wei, Qiuhong; Yao, Zhengxiong; Cui, Ying; Wei, Bo; Jin, Zhezhen; Xu, Ximing

作者地址:[Wei, Qiuhong; Xu, Ximing] Chongqing Med Univ, Childrens Hosp, Big Data Ctr Childrens Med Care, 136 Zhongshan 2nd Rd, Chongqing 400014, Peoples R China; [Yao, Zhengxiong] Chongqing Med Univ, Dept Neurol, Childrens Hosp, Chongqing, Peoples R China; [Cui, Ying] Stanford Univ, Sch Med, Dept Biomed Data Sci, Stanford, CA USA; [Wei, Bo] BeiGene USA Inc, Dept Global Stat & Data Sci, San Mateo, CA USA; [Jin, Zhezhen] Columbia Univ, Mailman Sch Publ Hlth, Dept Biostat, 722 West 168th St, New York, NY 10032 USA; [Wei, Qiuhong] Chongqing Med Univ, Children Nutr Res Ctr, Childrens Hosp, Chongqing, Peoples R China; [Wei, Qiuhong] Natl Clin Res Ctr Child Hlth & Disorders, Minist Educ,Key Lab Child Dev & Disorders, China Int Sci & Technol Cooperat Base Child Dev &, Key Lab Child Dev & Disorders,Chongqing Key Lab Ch, Chongqing, Peoples R China

通信作者:Xu, XM (通讯作者),Chongqing Med Univ, Childrens Hosp, Big Data Ctr Childrens Med Care, 136 Zhongshan 2nd Rd, Chongqing 400014, Peoples R China.; Jin, ZZ (通讯作者),Columbia Univ, Mailman Sch Publ Hlth, Dept Biostat, 722 West 168th St, New York, NY 10032 USA.

来源:JOURNAL OF BIOMEDICAL INFORMATICS

ESI学科分类:COMPUTER SCIENCE

WOS号:WOS:001218826900001

JCR分区:Q2

影响因子:4

年份:2024

卷号:151

期号: 

开始页: 

结束页: 

文献类型:Article

关键词:ChatGPT; Large language model; Medicine; Evaluation

摘要:Objective: Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research. Methods: An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327. Results: A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %- 60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, questionasking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency. Conclusion: This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.

基金机构: 

基金资助正文: