Abstract:A key challenge in genetic diagnosis is the assessment of pathogenic genetic mutations related to splicing. Existing predictive tools for pathogenic splicing mutations are mostly based on traditional machine learning methods, heavily relying on manually extracted splicing features. Thereby the predictive performance is limited, especially for non-canonical splicing mutation producing poor performance. Therefore, a bidirectional encoder representations from transformers(BERT) and convolutional neural network(CNN)-based deleterious splicing mutation prediction method(BCsplice) is proposed. The BERT module in BCsplice comprehensively extracts contextual information of sequences. While combined with CNN that extracts local features, BERT module can adequately learn the semantic information of sequences and predict the pathogenicity of splicing mutations. The impact of non-canonical splicing mutations often relies more on deep semantic information of sequence context. By combining and extracting the multi-level semantic information of BERT through CNN, rich information representations can be obtained, aiding in the identification of non-canonical splicing mutations. Comparative experiments demonstrate the superior performance of BCsplice, especially exhibiting certain performance advantages in non-canonical splicing regions, and it contributes to the identification of pathogenic splicing mutations and clinical genetic diagnosis.
[1] JIN Y F, DONG H Y, SHI Y, et al. Mutually Exclusive Alternative Splicing of Pre-mRNAs. Wiley Interdisciplinary Reviews. RNA, 2018, 9(3). DOI: 10.1002/wrna.1468. [2] ELKON R, UGALDE A P, AGAMI R.Alternative Cleavage and Polyadenylation: Extent, Regulation and Function. Nature Reviews Genetics, 2013, 14(7): 496-506. [3] BONNAL S C, LÓPEZ-OREJA I, VALCÁRCEL J. Roles and Mecha-nisms of Alternative Splicing In Cancer-Implications for Care. Nature Reviews Clinical Oncology, 2020, 17(8): 457-474. [4] WANG E T, SANDBERG R, LUO S J, et al. Alternative Isoform Regulation in Human Tissue Transcriptomes. Nature, 2008, 456(7221): 470-476. [5] PAN Q, SHAI O, LEE L J, et al. Deep Surveying of Alternative Splicing Complexity in the Human Transcriptome by High-Throughput Sequencing. Nature Genetics, 2008, 40(12): 1413-1415. [6] LI K K, XIAO J F, LING Z B, et al. Prioritizing de Novo Potential Non-Canonical Splicing Variants in Neurodevelopmental Disorders. eBiomedicine, 2024, 99. DOI: 10.1016/j.ebiom.2023.104928. [7] CAO S, ZHOU D C, OH C, et al. Discovery of Driver Non-Coding Splice-Site-Creating Mutations in Cancer. Nature Communications, 2020, 11(1). DOI: 10.1038/s41467-020-19307-6. [8] CHEN S S, BENBARCHE S, ABDEL-WAHAB O.Splicing Factor Mutations in Hematologic Malignancies. Blood, 2021, 138(8): 599-612. [9] STANLEY R F, ABDEL-WAHAB O.Dysregulation and Therapeutic Targeting of RNA Splicing in Cancer. Nature Cancer, 2022, 3(5): 536-546. [10] LI Y I, VAN DE GEIJN B, RAJ A, et al. RNA Splicing Is a Primary Link between Genetic Variation and Disease. Science, 2016, 352(6285): 600-604. [11] WAGNER N, ÇELIK M H, HÖLZLWIMMER F R, et al. Aberrant Splicing Prediction across Human Tissues. Nature Genetics, 2023, 55(5): 861-870. [12] SCOTTI M M, SWANSON M S.RNA Mis-Splicing in Disease. Nature Reviews Genetics, 2016, 17(1): 19-32. [13] WANG E, AIFANTIS I.RNA Splicing and Cancer. Trends in Cancer, 2020, 6(8): 631-644. [14] PELLAGATTI A, BOULTWOOD J.Splicing Factor Mutations in the Myelodysplastic Syndromes: Role of Key Aberrantly Spliced Genes in Disease Pathophysiology and Treatment. Advances in Biological Regulation, 2023, 87. DOI: 10.1016/j.jbior.2022.100920. [15] LIANG B, MENG D D, CAO Y, et al. A Novel Splice-Site Variant of the LAMB3 Gene Is Associated with Junctional Epidermolysis Bullosa. European Journal of Dermatology, 2022, 32(5): 632-636. [16] RICHARDS S, AZIZ N, BALE S, et al. Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Re-commendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine, 2015, 17(5): 405-424. [17] JAYASINGHE R G, CAO S, GAO Q S, et al. Systematic Analysis of Splice-Site-Creating Mutations in Cancer. Cell Reports, 2018, 23(1): 270-281. [18] JUNG H, LEE K S, CHOI J K.Comprehensive Characterisation of Intronic Mis-Splicing Mutations in Human Cancers. Oncogene, 2021, 40(7): 1347-1361. [19] DANIS D, JACOBSEN J O B, CARMODY L C, et al. Interpretable Prioritization of Splice Variants in Diagnostic Next-Generation Sequencing. American Journal of Human Genetics, 2021, 108(9): 1564-1577. [20] MATHER C A, MOONEY S D, SALIPANTE S J, et al. CADD Score Has Limited Clinical Validity for the Identification of Pathogenic Variants in Noncoding Regions in a Hereditary Cancer Panel. Genetics in Medicine, 2016,18(12):1269-1275. [21] CHEUNG R, INSIGNE K D, YAO D, et al. A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions. Molecular Cell, 2019, 73(1): 183-194. [22] JAGANATHAN K, PANAGIOTOPOULOU S K, MCRAE J F, et al. Predicting Splicing from Primary Sequence with Deep Lear-ning. Cell, 2019, 176(3): 535-548. [23] YEO G, BURGE C B.Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. Journal of Computational Biology, 2004, 11(2/3): 377-394. [24] CHENG J, NGUYEN T Y D, CYGAN K J, et al. MMSplice: Mo-dular Modeling Improves the Predictions of Genetic Variant Effects on Splicing. Genome Biology, 2019, 20(1). DOI: 10.1186/s13059-019-1653-z. [25] RENTZSCH P, SCHUBACH M, SHENDURE J, et al. CADD-Splice-Improving Genome-Wide Variant Effect Prediction Using Deep Learning-Derived Splice Scores. Genome Medicine, 2021, 13(1). DOI: 10.1186/s13073-021-00835-9. [26] LEMAN R, PARFAIT B, VIDAUD D, et al. SPiP: Splicing Prediction Pipeline, A Machine Learning Tool for Massive Detection of Exonic and Intronic Variant Effects on mRNA Splicing. Human Mutation, 2022, 43(12): 2308-2323. [27] JAGADEESH K A, PAGGI J M, YE J S, et al. S-CAP Extends Pathogenicity Prediction to Genetic Variants That Affect RNA Splicing. Nature Genetics, 2019, 51(4): 755-763. [28] SEARLS D B.The Language of Genes. Nature, 2002, 420(6912): 211-217. [29] DAVULURI R V, SUZUKI Y, SUGANO S, et al. The Functional Consequences of Alternative Promoter Use in Mammalian Genomes. Trends in Genetics, 2008, 24(4): 167-177. [30] JI Y R, MISHRA R K, DAVULURI R V.In Silico Analysis of Alternative Splicing on Drug-Target Gene Interactions. Scientific Reports, 2020, 10(1). DOI: 10.1038/s41598-019-56894-x. [31] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding//Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186. [32] KIM Y. Convolutional Neural Networks for Sentence Classification//Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2014: 1746-1751. [33] JI Y R, ZHOU Z H, LIU H, et al. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome. Bioinformatics, 2021, 37(15): 2112-2120. [34] STENSON P D, MORT M, BALL E V, et al. The Human Gene Mutation Database(HGMD®): Optimizing Its Use in a Clinical Diagnostic or Research Setting. Human Genetics, 2020, 139(10): 1197-1207. [35] LANDRUM M J, CHITIPIRALLA S, BROWN G R, et al. ClinVar: Improvements to Accessing Data. Nucleic Acids Research, 2020, 48(D1): D835-D844. [36] LIU H, DAI J Q, LI K, et al. Performance Evaluation of Computational Methods for Splice-Disrupting Variants and Improving the Performance Using the Machine Learning-Based Framework. Brie-fings in Bioinformatics, 2022, 23(5). DOI: 10.1093/bib/bbac334.