Discovery of potential biomarkers for lung cancer classification based on human proteome microarrays using Stochastic Gradient Boosting approach

作者全名:"Yao, Ning; Pan, Jianbo; Chen, Xicheng; Li, Pengpeng; Li, Yang; Wang, Zhenyan; Yao, Tianhua; Qian, Li; Yi, Dong; Wu, Yazhou"

作者地址:"[Yao, Ning; Chen, Xicheng; Li, Pengpeng; Li, Yang; Wang, Zhenyan; Yao, Tianhua; Qian, Li; Yi, Dong; Wu, Yazhou] Army Med Univ, Coll Prevent Med, Dept Hlth Stat, 30 Gaotanyan St, Chongqing 400038, Peoples R China; [Yao, Ning] Chongqing Ctr Dis Control & Prevent, 8 Changjiang 2nd St, Chongqing 400042, Peoples R China; [Pan, Jianbo] Chongqing Med Univ, Inst Life Sci, Ctr Novel Target & Therapeut Intervent, Chongqing 400016, Peoples R China"

通信作者:"Yi, D; Wu, YZ (通讯作者),Army Med Univ, Coll Prevent Med, Dept Hlth Stat, 30 Gaotanyan St, Chongqing 400038, Peoples R China."

来源:JOURNAL OF CANCER RESEARCH AND CLINICAL ONCOLOGY

ESI学科分类:CLINICAL MEDICINE

WOS号:WOS:000934849300001

JCR分区:Q3

影响因子:2.7

年份:2023

卷号: 

期号: 

开始页: 

结束页: 

文献类型:Article; Early Access

关键词:Lung cancer; Classification; Biomarkers; Ensemble learning; Proteome microarrays

摘要:"PurposeEarly identification of lung cancer (LC) will considerably facilitate the intervention and prevention of LC. The human proteome micro-arrays approach can be used as a ""liquid biopsy"" to diagnose LC to complement conventional diagnosis, which needs advanced bioinformatics methods such as feature selection (FS) and refined machine learning models.MethodsA two-stage FS methodology by infusing Pearson's Correlation (PC) with a univariate filter (SBF) or recursive feature elimination (RFE) was used to reduce the redundancy of the original dataset. The Stochastic Gradient Boosting (SGB), Random Forest (RF), and Support Vector Machine (SVM) techniques were applied to build ensemble classifiers based on four subsets. The synthetic minority oversampling technique (SMOTE) was used in the preprocessing of imbalanced data.ResultsFS approach with SBF and RFE extracted 25 and 55 features, respectively, with 14 overlapped ones. All three ensemble models demonstrate superior accuracy (ranging from 0.867 to 0.967) and sensitivity (0.917 to 1.00) in the test datasets with SGB of SBF subset outperforming others. The SMOTE technique has improved the model performance in the training process. Three of the top selected candidate biomarkers (LGR4, CDC34, and GHRHR) were highly suggested to play a role in lung tumorigenesis.ConclusionA novel hybrid FS method with classical ensemble machine learning algorithms was first used in the classification of protein microarray data. The parsimony model constructed by the SGB algorithm with the appropriate FS and SMOTE approach performs well in the classification task with higher sensitivity and specificity. Standardization and innovation of bioinformatics approach for protein microarray analysis need further exploration and validation."

基金机构:"National Natural Science Foundation of China [82173621, 81872716]"

基金资助正文:"This work was supported by the National Natural Science Foundation of China (No. 82173621, 81872716)."