MSPJ: Discovering potential biomarkers in small gene expression datasets <i>via</i> ensemble learning

作者全名:"Yin, HuaChun; Tao, JingXin; Peng, Yuyang; Xiong, Ying; Li, Bo; Li, Song; Yang, Hui"

作者地址:"[Yin, HuaChun; Peng, Yuyang; Li, Song; Yang, Hui] Army Med Univ, Xinqiao Hosp, Dept Neurosurg, Chongqing 400037, Peoples R China; [Yin, HuaChun; Tao, JingXin; Li, Bo] Chongqing Normal Univ, Coll Life Sci, Chongqing 401331, Peoples R China; [Yin, HuaChun; Xiong, Ying] Army Med Univ, Chongqing Key Lab Neurobiol, Dept Neurobiol, Chongqing 400038, Peoples R China; [Li, Song; Yang, Hui] Chongqing Inst Brain & Intelligence, Guangyang Bay Lab, Chongqing, Peoples R China"

通信作者:"Li, B (通讯作者),Chongqing Normal Univ, Coll Life Sci, Chongqing 401331, Peoples R China.; Li, S; Yang, H (通讯作者),Chongqing Inst Brain & Intelligence, Guangyang Bay Lab, Chongqing, Peoples R China."

来源:COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL

ESI学科分类:BIOLOGY & BIOCHEMISTRY

WOS号:WOS:001060424500001

JCR分区:Q1

影响因子:6

年份:2022

卷号:20

期号: 

开始页:3783

结束页:3795

文献类型:Article

关键词:Small sample size; Random sampling; Feature selection; Differentially expressed genes; Machine learning

摘要:"In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes. (c) 2022 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access article under the CC BY-NC-ND license (http://creative-commons.org/licenses/by-nc-nd/4.0/)."

基金机构:"Nursery Project of Army Medical University [2019R054]; Natural Science Foundation of Chongqing, China [CSTC2019JCYJ-MSXMX0527]; Open Fund of Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology, Yunnan University, Chongqing Technology Innovation and Application Development Special key Project [cstc2019jscx-dxwtBX0010]; Science and Technology Research Program of Chongqing Municipal Education Commission [KJQN202100538]"

基金资助正文:"This work was supported by grants from the Nursery Project of Army Medical University (No. 2019R054), Natural Science Foundation of Chongqing, China (Grant No. CSTC2019JCYJ-MSXMX0527), Open Fund of Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology, Yunnan University, Chongqing Technology Innovation and Application Development Special key Project (cstc2019jscx-dxwtBX0010), and Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJQN202100538)."