Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions.
Approximate Bayesian calculation
Co-evolution
Low-complexity region
Protein abundance
Transcript abundance
Journal
Journal of molecular evolution
ISSN: 1432-1432
Titre abrégé: J Mol Evol
Pays: Germany
ID NLM: 0360051
Informations de publication
Date de publication:
14 Mar 2024
14 Mar 2024
Historique:
received:
05
10
2023
accepted:
24
01
2024
medline:
15
3
2024
pubmed:
15
3
2024
entrez:
15
3
2024
Statut:
aheadofprint
Résumé
Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.
Identifiants
pubmed: 38485789
doi: 10.1007/s00239-024-10158-z
pii: 10.1007/s00239-024-10158-z
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : National Sciences and Engineering Research Council of Canada
ID : RGPIN-202-05733
Organisme : Natural Sciences and Engineering Research Council of Canada
ID : PGSD3-547476-2020
Informations de copyright
© 2024. The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Références
Akaike H (1998) Selected Papers of Hirotugu Akaike. Chapter Information Theory and an Extension of the Maximum Likelihood Principle. Springer, New York, pp 199–213. https://doi.org/10.1007/978-1-4612-1694-0_15
doi: 10.1007/978-1-4612-1694-0_15
Andrews S (2015) Fastqc. https://www.bioinformatics.babraham.ac.uk/projects/fastqc
Andrieu C, Thoms J (2008) A tutorial on adaptive MCMC. Stat Comput 18:343–373
doi: 10.1007/s11222-008-9110-y
Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025–2035
pubmed: 12524368
pmcid: 1462356
doi: 10.1093/genetics/162.4.2025
Bedford T, Hartl D (2009) Optimization of gene expression by natural selection. Proc Natl Acad Sci USA 106:1133–1138
pubmed: 19139403
pmcid: 2633540
doi: 10.1073/pnas.0812009106
Bourque G, Leong B, Vega V, Chen X, Lee Y, Srinivasan K, Chew J, Ruan Y, Wei C, Ng H et al (2008) Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res 18:1752–1762
pubmed: 18682548
pmcid: 2577865
doi: 10.1101/gr.080663.108
Bradley R, Li X, Trapnell C, Davidson S, Pachter L, Chu H, Tonkin L, Biggin M, Eisen M (2010) Binding site turnover produces pervasive quantitative changes in transcription factor binding between closely related Drosophila species. PLoS Biol 8:e1000343
pubmed: 20351773
pmcid: 2843597
doi: 10.1371/journal.pbio.1000343
Byrska-Bishop M, Evani U, Zhao X, Basile A, Abel H, Regier A, Corvelo A, Clarke W, Musunuri R, Nagulapalli K et al (2022) High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 185:3426-3440.e19
pubmed: 36055201
pmcid: 9439720
doi: 10.1016/j.cell.2022.08.004
Chavali S, Chavali PL, Chalancon G, deGroot NS, Gemayel R, Latysheva NS, Ing-Simmons E, Verstrepen KJ, Balaji S, Babu MM (2017) Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins. Nat Struct Mol Biol 24:765–777
pubmed: 28805808
pmcid: 5603276
doi: 10.1038/nsmb.3441
Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890
pubmed: 30423086
pmcid: 6129281
doi: 10.1093/bioinformatics/bty560
Cook D, Andersen E (2017) VCF-kit: assorted utilities for the variant call format. Bioinformatics 33:1581–1582
pubmed: 28093408
pmcid: 5423453
doi: 10.1093/bioinformatics/btx011
Cummings CJ, Zoghbi HY (2000) Fourteen and counting: unraveling trinucleotide repeat diseases. Hum Mol Genet 9:909–16
pubmed: 10767314
doi: 10.1093/hmg/9.6.909
DePristo MA, Zilversmit MM, Hartl DL (2006) On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene 378:19–30
pubmed: 16806741
doi: 10.1016/j.gene.2006.03.023
Dickson Z, Golding G (2022) Low complexity regions in mammalian proteins are associated with low protein abundance and high transcript abundance. Mol Biol Evol 39:mcac087
doi: 10.1093/molbev/msac087
Dieringer D, Schlotterer C (2003) Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res 13:2242–2251
pubmed: 14525926
pmcid: 403688
doi: 10.1101/gr.1416703
Dobin A, Davis C, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras T (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21
pubmed: 23104886
doi: 10.1093/bioinformatics/bts635
Dosztányi Z, Csizmók V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839
pubmed: 15769473
doi: 10.1016/j.jmb.2005.01.071
Ebert P, Audano P, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder M, Sulovari A, Ebler J, Zhou W, SerraMari R et al (2021) Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372:abf7177
doi: 10.1126/science.abf7117
Enright J, Dickson Z, Golding G (2023) Low complexity regions in proteins and DNA are poorly correlated. Mol Biol Evol 40:msad084
pubmed: 37036379
pmcid: 10124876
doi: 10.1093/molbev/msad084
Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20:406–416
doi: 10.1093/sysbio/20.4.406
Fomicheva A, Ross E (2021) From prions to stress granules: defining the compositional features of prion-like domains that promote different types of assemblies. Int J Mol Sci 22:1251
pubmed: 33513942
pmcid: 7865556
doi: 10.3390/ijms22031251
Golding GB (1999) Simple sequence is abundant in eukaryotic proteins. Protein Sci 8:1358–61
pubmed: 10386886
pmcid: 2144344
doi: 10.1110/ps.8.6.1358
Gonzalez CE, Roberts P, Ostermeier M (2019) Fitness effects of single amino acid insertions and deletions in tem-1 beta-lactamase. J Mol Biol 431:2320–2330
pubmed: 31034887
pmcid: 6554054
doi: 10.1016/j.jmb.2019.04.030
Goolsby E (2017) Rapid maximum likelihood ancestral state reconstruction of continuous characters: a rerooting-free algorithm. Ecol Evol 7:2791–2797
pubmed: 28428869
pmcid: 5395464
doi: 10.1002/ece3.2837
Grimwood J, Gordon L, Olsen A, Terry A, Schmutz J, Lamerdin J, Hellsten U, Goodstein D, Couronne O, Tran-Gyamfi M et al (2004) The DNA sequence and biology of human chromosome 19. Nature 428:529–535
pubmed: 15057824
doi: 10.1038/nature02399
Haba Y, Kutsukake N (2019) A multivariate phylogenetic comparative method incorporating a flexible function between discrete and continuous traits. Evol Ecol 33:751–768
doi: 10.1007/s10682-019-10011-6
Haerty W, Golding G (2010) Low-complexity sequences and single amino acid repeats: not just “junk” peptide sequences. Genome 53:753–762
pubmed: 20962881
doi: 10.1139/G10-063
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109
doi: 10.1093/biomet/57.1.97
He Q, Bardet A, Patton B, Purvis J, Johnston J, Paulson A, Gogol M, Stark A, Zeitlinger J (2011) High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nat Genet 43:414–420
pubmed: 21478888
doi: 10.1038/ng.808
Holst L (1980) On the lengths of the pieces of a stick broken at random. J Appl Probab 17:623–634
doi: 10.2307/3212956
Horton C, Alexandari A, Hayes M, Marklund E, Schaepe J, Aditham A, Shah N, Suzuki P, Shrikumar A, Afek A et al (2023) Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381:eadd1250
pubmed: 37733848
doi: 10.1126/science.add1250
Huntley M, Golding G (2000) Evolution of simple sequence in proteins. J Mol Evol 51:131–140
pubmed: 10948269
doi: 10.1007/s002390010073
Huntley M, Golding G (2002) Simple sequences are rare in the protein data bank. Proteins 48:134–140
pubmed: 12012345
doi: 10.1002/prot.10150
Huntley M, Golding G (2006) Selection and slippage creating serine homopolymers. Mol Biol Evol 23:2017–2025
pubmed: 16877497
doi: 10.1093/molbev/msl073
Huntley MA, Golding GB (2006) Selection and slippage creating serine homopolymers. Mol Biol Evol 23:2017–2025
pubmed: 16877497
doi: 10.1093/molbev/msl073
Karlin S, Brocchieri L, Bergman A, Mrázek J, Gentles AJ (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci 99:333–338
pubmed: 11782551
pmcid: 117561
doi: 10.1073/pnas.012608599
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–80
pubmed: 23329690
pmcid: 3603318
doi: 10.1093/molbev/mst010
Kiefer J (1953) Sequential minimax search for a maximum. Proc Am Math Soc 4:502–506
doi: 10.1090/S0002-9939-1953-0055639-3
Kruglyak S, Durrett R, Schug M, Aquadro C (1998) Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc Natl Acad Sci USA 95:10774–10778
pubmed: 9724780
pmcid: 27971
doi: 10.1073/pnas.95.18.10774
Lenz C, Haerty W, Golding GB (2014) Increased substitution rates surrounding low-complexity regions within primate proteins. Genome Biol Evol 6:655–65
pubmed: 24572016
pmcid: 3971593
doi: 10.1093/gbe/evu042
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993
pubmed: 21903627
pmcid: 3198575
doi: 10.1093/bioinformatics/btr509
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760
pubmed: 19451168
pmcid: 2705234
doi: 10.1093/bioinformatics/btp324
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
pubmed: 19505943
pmcid: 2723002
doi: 10.1093/bioinformatics/btp352
Lin M, Whitmire S, Chen J, Farrel A, Shi X, Jt Guo (2017) Effects of short indels on protein structure and function in human genomes. Sci Rep 7:9313
pubmed: 28839204
pmcid: 5570956
doi: 10.1038/s41598-017-09287-x
Love M, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550
pubmed: 25516281
pmcid: 4302049
doi: 10.1186/s13059-014-0550-8
Loya T, O’Rourke T, Reines D (2017) The hnRNP-like Nab3 termination factor can employ heterologous prion-like domains in place of its own essential low complexity domain. PLoS ONE 12:e0186187
pubmed: 29023495
pmcid: 5638401
doi: 10.1371/journal.pone.0186187
Marjoram P, Molitor J, Plagnol V, Tavare S (2003) Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci U S A 100:15324–15328
pubmed: 14663152
pmcid: 307566
doi: 10.1073/pnas.0306899100
Martin E, Mittag T (2018) Relationship of sequence and phase separation in protein low-complexity regions. Biochemistry 57:2478–2487
pubmed: 29517898
doi: 10.1021/acs.biochem.8b00008
McGinnis S, Madden T (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32:W20-5
pubmed: 15215342
pmcid: 441573
doi: 10.1093/nar/gkh435
Mier P, Alanis-Lobato G, Andrade-Navarro MA (2017) Context characterization of amino acid homorepeats using evolution, position, and order. Proteins 85:709–719
pubmed: 28097686
doi: 10.1002/prot.25250
Minh B, Schmidt H, Chernomor O, Schrempf D, Woodhams M, vonHaeseler A, Lanfear R (2020) IQ-TREE 2: new models and efficient methods for phylogenetic inference in the Genomic Era. Mol Biol Evol 37:1530–1534
pubmed: 32011700
pmcid: 7182206
doi: 10.1093/molbev/msaa015
Ni X, Zhang Y, Negre N, Chen S, Long M, White K (2012) Adaptive evolution and the birth of CTCF binding sites in the Drosophila genome. PLoS Biol 10:e1001420
pubmed: 23139640
pmcid: 3491045
doi: 10.1371/journal.pbio.1001420
Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, MacIsaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E (2007) Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet 39:730–732
pubmed: 17529977
pmcid: 3797512
doi: 10.1038/ng2047
Pál C, Papp B, Hurst LD (2001) Highly expressed genes in yeast evolve slowly. Genetics 158:927–931
pubmed: 11430355
pmcid: 1461684
doi: 10.1093/genetics/158.2.927
Parry D, North A (1998) Hard alpha-keratin intermediate filament chains: substructure of the N- and C-terminal domains and the predicted structure and function of the C-terminal domains of type I and type II chains. J Struct Biol 122:67–75
pubmed: 9724606
doi: 10.1006/jsbi.1998.3967
Persi E, Wolf Y, Karamycheva S, Makarova K, Koonin E (2023) Compensatory relationship between low-complexity regions and gene paralogy in the evolution of prokaryotes. Proc Natl Acad Sci USA 120:e2300154120
pubmed: 37036997
pmcid: 10120016
doi: 10.1073/pnas.2300154120
Persikov A, Ramshaw J, Kirkpatrick A, Brodsky B (2000) Amino acid propensities for the collagen triple-helix. Biochemistry 39:14960–14967
pubmed: 11101312
doi: 10.1021/bi001560d
Pertea M, Pertea G, Antonescu C, Chang T, Mendell J, Salzberg S (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33:290–295
pubmed: 25690850
pmcid: 4643835
doi: 10.1038/nbt.3122
Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16:1791–1798
pubmed: 10605120
doi: 10.1093/oxfordjournals.molbev.a026091
R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Revell LJ (2012) Phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol 3:217–223
doi: 10.1111/j.2041-210X.2011.00169.x
Rohlfs R, Harrigan P, Nielsen R (2014) Modeling gene expression evolution with an extended Ornstein–Uhlenbeck process accounting for within-species variation. Mol Biol Evol 31:201–211
pubmed: 24113538
doi: 10.1093/molbev/mst190
Romero P, Obradovic Z, Li X, Garner E, Brown C, Dunker A (2001) Sequence complexity of disordered protein. Proteins 42:38–48
pubmed: 11093259
doi: 10.1002/1097-0134(20010101)42:1<38::AID-PROT50>3.0.CO;2-3
Sainudiin R, Durrett R, Aquadro C, Nielsen R (2004) Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics 168:383–395
pubmed: 15454551
pmcid: 1448085
doi: 10.1534/genetics.103.022665
Schmon S, Gagnon P (2022) Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics. Stat Comput 32:28
pubmed: 35310543
pmcid: 8924149
doi: 10.1007/s11222-022-10080-8
Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, et al. 2016. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. bioRxiv https://www.biorxiv.org/content/early/2016/08/30/072116
Sequencing C, Consortium A (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87
doi: 10.1038/nature04072
Shen W, Ren H (2021) Taxonkit: a practical and efficient ncbi taxonomy toolkit. J Genet Genomics 48:844–850
pubmed: 34001434
doi: 10.1016/j.jgg.2021.03.006
Shi J, Rabosky D (2015) Speciation dynamics during the global radiation of extant bats. Evolution 69:1528–1545
pubmed: 25958922
doi: 10.1111/evo.12681
Shumate A, Salzberg S (2021) Liftoff: accurate mapping of gene annotations. Bioinformatics 37:1639–1643
pubmed: 33320174
pmcid: 8289374
doi: 10.1093/bioinformatics/btaa1016
Stajich J, Block D, Boulez K, Brenner S, Chervitz S, Dagdigian C, Fuellen G, Gilbert J, Korf I, Lapp H et al (2002) The bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618
pubmed: 12368254
pmcid: 187536
doi: 10.1101/gr.361602
Vats D, Flegal JM, Jones GL. (2017). Multivariate output analysis for Markov chain Monte Carlo. arXiv:1512.07713
Villar D, Flicek P, Odom D (2014) Evolution of transcription factor binding in metazoans - mechanisms and functional implications. Nat Rev Genet 15:221–233
pubmed: 24590227
pmcid: 4175440
doi: 10.1038/nrg3481
Wall L, Christiansen T, Orwant J. 2000. Programming perl. " O’Reilly Media, Inc."
Werner M, Sieriebriennikov B, Prabh N, Loschko T, Lanz C, Sommer R (2018) Young genes have distinct gene structure, epigenetic profiles, and transcriptional regulation. Genome Res 28:1675–1687
pubmed: 30232198
pmcid: 6211652
doi: 10.1101/gr.234872.118
Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers Chem 17:149–163
doi: 10.1016/0097-8485(93)85006-X
Zhou K, Shi H, Lyu R, Wylder A, Matuszek Z, Pan J, He C, Parisien M, Pan T (2019) Regulation of co-transcriptional pre-mRNA splicing by m(6)A through the low-complexity protein hnRNPG. Mol Cell 76:70-81.e9
pubmed: 31445886
pmcid: 6778029
doi: 10.1016/j.molcel.2019.07.005