A Transferable Machine Learning Framework for Predicting Transcriptional Responses of Genes Across Species.
Dinucleotide frequency
Gene annotation
Grasses
Machine learning
Random forest
Transfer learning
Journal
Methods in molecular biology (Clifton, N.J.)
ISSN: 1940-6029
Titre abrégé: Methods Mol Biol
Pays: United States
ID NLM: 9214969
Informations de publication
Date de publication:
2023
2023
Historique:
medline:
11
9
2023
pubmed:
8
9
2023
entrez:
8
9
2023
Statut:
ppublish
Résumé
Leveraging existing resources in studied species to predict gene functions has the potential to rapidly expand understanding of annotated genes in other, less well-studied, species with assembled genomes. However, orthology is not a reliable predictor for the transcriptional responses of genes to stress. Machine learning methods can quantitatively estimate expression patterns and gene functions using known annotations and collections of features describing each gene. In this chapter, we describe a supervised machine learning framework to predict stress-responsive genes across species using only features derived from nucleotide sequences, using the example of cold stress-responsive genes in different Panicoid grass species.
Identifiants
pubmed: 37682485
doi: 10.1007/978-1-0716-3354-0_21
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
361-379Informations de copyright
© 2023. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.
Références
Curwen V, Eyras E, Andrews TD et al (2004) The Ensembl automatic gene annotation system. Genome Res 14:942–950
doi: 10.1101/gr.1858004
pubmed: 15123590
pmcid: 479124
Washburn JD, Mejia-Guerra MK, Ramstein G et al (2019) Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proc Natl Acad Sci U S A 116:5542–5549
doi: 10.1073/pnas.1814551116
pubmed: 30842277
pmcid: 6431157
Azodi CB, Lloyd JP, Shiu S-H (2020) The cis-regulatory codes of response to combined heat and drought stress in Arabidopsis thaliana. NAR Genom Bioinform 2:lqaa049
doi: 10.1093/nargab/lqaa049
pubmed: 33575601
pmcid: 7671360
Zhou P, Enders TA, Myers ZA et al (2022) Prediction of conserved and variable heat and cold stress response in maize using cis-regulatory information. Plant Cell 34:514–534
doi: 10.1093/plcell/koab267
pubmed: 34735005
Zou C, Sun K, Mackaluso JD et al (2011) Cis-regulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc Natl Acad Sci U S A 108:14992–14997
doi: 10.1073/pnas.1103202108
pubmed: 21849619
pmcid: 3169165
Schreiber J, Singh R (2021) Machine learning for profile prediction in genomics. Curr Opin Chem Biol 65:35–41
doi: 10.1016/j.cbpa.2021.04.008
pubmed: 34107341
Jiao Y, Peluso P, Shi J et al (2017) Improved maize reference genome with single-molecule technologies. Nature 546:524
doi: 10.1038/nature22971
pubmed: 28605751
pmcid: 7052699
McCormick RF, Truong SK, Sreedasyam A et al (2018) The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. Plant J 93:338–354
doi: 10.1111/tpj.13781
pubmed: 29161754
Bennetzen JL, Schmutz J, Wang H et al (2012) Reference genome sequence of the model plant Setaria. Nat Biotechnol 30:555
doi: 10.1038/nbt.2196
pubmed: 22580951
Lovell JT, MacQueen AH, Mamidi S et al (2021) Genomic mechanisms of climate adaptation in polyploid bioenergy switchgrass. Nature 590:438–444
doi: 10.1038/s41586-020-03127-1
pubmed: 33505029
pmcid: 7886653
Zou C, Li L, Miki D et al (2019) The genome of broomcorn millet. Nat Commun 10:436
doi: 10.1038/s41467-019-08409-5
pubmed: 30683860
pmcid: 6347628
Varshney RK, Shi C, Thudi M et al (2017) Pearl millet genome sequence provides a resource to improve agronomic traits in arid environments. Nat Biotechnol 35:969–976
doi: 10.1038/nbt.3943
pubmed: 28922347
pmcid: 6871012
Zhang Y, Ngu DW, Carvalho D et al (2017) Differentially regulated orthologs in sorghum and the subgenomes of maize. Plant Cell 29(8):1938–1951
doi: 10.1105/tpc.17.00354
pubmed: 28733421
pmcid: 5590507
Meng X, Liang Z, Dai X et al (2021) Predicting transcriptional responses to cold stress across plant species. Proc Natl Acad Sci U S A 118:e2026330118. https://doi.org/10.1073/pnas.2026330118
doi: 10.1073/pnas.2026330118
pubmed: 33658387
pmcid: 7958178
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120
doi: 10.1093/bioinformatics/btu170
pubmed: 24695404
pmcid: 4103590
Wu TD, Reeder J, Lawrence M et al (2016) GMAP and GSNAP for genomic sequence alignment: enhancements to speed, accuracy, and functionality. Methods Mol Biol 1418:283–334
doi: 10.1007/978-1-4939-3578-9_15
pubmed: 27008021
Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
doi: 10.1093/bioinformatics/btp352
pubmed: 19505943
pmcid: 2723002
Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
doi: 10.1038/nbt.1621
pubmed: 20436464
pmcid: 3146043
Anders S, Pyl PT, Huber W (2015) HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2):166–169
doi: 10.1093/bioinformatics/btu638
pubmed: 25260700
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550
doi: 10.1186/s13059-014-0550-8
pubmed: 25516281
pmcid: 4302049
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
doi: 10.1093/nar/25.17.3389
pubmed: 9254694
pmcid: 146917
Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189
doi: 10.1101/gr.1224503
pubmed: 12952885
pmcid: 403725
Breiman L (2001) Random forests. Mach Learn 45:5–32
doi: 10.1023/A:1010933404324
Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21
doi: 10.1093/bioinformatics/bts635
pubmed: 23104886
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357–360
doi: 10.1038/nmeth.3317
pubmed: 25751142
pmcid: 4655817
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140
doi: 10.1093/bioinformatics/btp616
pubmed: 19910308
Liang Z, Anderson SN, Noshay JM et al (2021) Genetic and epigenetic variation in transposable element expression responses to abiotic stress in maize. Plant Physiol 186:420–433
doi: 10.1093/plphys/kiab073
pubmed: 33591319
pmcid: 8154091
Li Y, Wang X, Li Y et al (2020) Transcriptomic analysis revealed the common and divergent responses of maize seedling leaves to cold and heat stresses. Genes 11:881. https://doi.org/10.3390/genes11080881
doi: 10.3390/genes11080881
pubmed: 32756433
pmcid: 7464670
Bieniawska Z, Espinoza C, Schlereth A et al (2008) Disruption of the Arabidopsis circadian clock is responsible for extensive variation in the cold-responsive transcriptome. Plant Physiol 147:263–279
doi: 10.1104/pp.108.118059
pubmed: 18375597
pmcid: 2330297
Lai X, Bendix C, Yan L et al (2020) Interspecific analysis of diurnal gene regulation in panicoid grasses identifies known and novel regulatory motifs. BMC Genomics 21:428
doi: 10.1186/s12864-020-06824-3
pubmed: 32586356
pmcid: 7315539