Designing efficient randstrobes for sequence similarity analyses

作者全名:Karami, Moein; Mohammadi, Aryan Soltani; Martin, Marcel; Ekim, Baris; Shen, Wei; Guo, Lidong; Xu, Mengyang; Pibiri, Giulio Ermanno; Patro, Rob; Sahlin, Kristoffer

作者地址:[Karami, Moein; Mohammadi, Aryan Soltani; Sahlin, Kristoffer] Stockholm Univ, Dept Math, Sci Life Lab, S-10691 Stockholm, Sweden; [Martin, Marcel] Stockholm Univ, Dept Biochem & Biophys, Sci Life Lab, Natl Bioinformat Infrastruct Sweden, SE-17121 Solna, Sweden; [Ekim, Baris] MIT, Comp Sci & Artificial Intelligence Lab CSAIL, Cambridge, MA 02139 USA; [Martin, Marcel] Broad Inst MIT & Harvard, Cambridge, MA 02142 USA; [Shen, Wei] Chongqing Med Univ, Inst Viral Hepatitis, Dept Infect Dis, Key Lab Mol Biol Infect Dis,Minist Educ,Affiliate, Chongqing 400010, Peoples R China; [Guo, Lidong] BGI Res, Qingdao 266555, Peoples R China; [Xu, Mengyang] BGI Res, Shenzhen 518083, Peoples R China; [Pibiri, Giulio Ermanno] Ca Foscari Univ Venice, Dept Environm Sci Informat & Stat, I-30172 Venice, Italy; [Pibiri, Giulio Ermanno] ISTI CNR, I-56124 Pisa, Italy; [Patro, Rob] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA; [Pibiri, Giulio Ermanno] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA

通信作者:Sahlin, K (通讯作者),Stockholm Univ, Dept Math, Sci Life Lab, S-10691 Stockholm, Sweden.

来源:BIOINFORMATICS

ESI学科分类:BIOLOGY & BIOCHEMISTRY

WOS号:WOS:001206629000004

JCR分区:Q1

影响因子:4.4

年份:2024

卷号:40

期号:4

开始页: 

结束页: 

文献类型:Article

关键词: 

摘要:Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy.Results In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.Availability and implementation All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

基金机构:Swedish Research Council (SRC, Vetenskapsradet) [2021-04000]; Knut and Alice Wallenberg Foundation as part of the National Bioinformatics Infrastructure Sweden at SciLifeLab; National Natural Science Foundation of China [32100514]; DAIS-Ca' Foscari University of Venice within the IRIDE program

基金资助正文:Kristoffer Sahlin was supported by the Swedish Research Council (SRC, Vetenskapsradet) [2021-04000]. Marcel Martin was financially supported by the Knut and Alice Wallenberg Foundation as part of the National Bioinformatics Infrastructure Sweden at SciLifeLab. Mengyang Xu was supported by National Natural Science Foundation of China [32100514]. Giulio Ermanno Pibiri was partially supported by DAIS-Ca' Foscari University of Venice within the IRIDE program.