Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions.

Approximate Bayesian calculation Co-evolution Low-complexity region Protein abundance Transcript abundance

Journal

Journal of molecular evolution

ISSN: 1432-1432

Titre abrégé: J Mol Evol

Pays: Germany

ID NLM: 0360051

Informations de publication

Date de publication:
14 Mar 2024

Historique:

received: 05 10 2023

accepted: 24 01 2024

medline: 15 3 2024

pubmed: 15 3 2024

entrez: 15 3 2024

Statut: aheadofprint

Résumé

Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.

Identifiants

DOI: 10.1007/s00239-024-10158-z PMID: 38485789

pubmed: 38485789

doi: 10.1007/s00239-024-10158-z

pii: 10.1007/s00239-024-10158-z

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : National Sciences and Engineering Research Council of Canada

ID : RGPIN-202-05733

Organisme : Natural Sciences and Engineering Research Council of Canada

ID : PGSD3-547476-2020

Informations de copyright

Références

Akaike H (1998) Selected Papers of Hirotugu Akaike. Chapter Information Theory and an Extension of the Maximum Likelihood Principle. Springer, New York, pp 199–213. https://doi.org/10.1007/978-1-4612-1694-0_15

doi: 10.1007/978-1-4612-1694-0_15

Andrews S (2015) Fastqc. https://www.bioinformatics.babraham.ac.uk/projects/fastqc

Andrieu C, Thoms J (2008) A tutorial on adaptive MCMC. Stat Comput 18:343–373

doi: 10.1007/s11222-008-9110-y

Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025–2035

pubmed: 12524368 pmcid: 1462356 doi: 10.1093/genetics/162.4.2025

Bedford T, Hartl D (2009) Optimization of gene expression by natural selection. Proc Natl Acad Sci USA 106:1133–1138

pubmed: 19139403 pmcid: 2633540 doi: 10.1073/pnas.0812009106

Bourque G, Leong B, Vega V, Chen X, Lee Y, Srinivasan K, Chew J, Ruan Y, Wei C, Ng H et al (2008) Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res 18:1752–1762

pubmed: 18682548 pmcid: 2577865 doi: 10.1101/gr.080663.108

Bradley R, Li X, Trapnell C, Davidson S, Pachter L, Chu H, Tonkin L, Biggin M, Eisen M (2010) Binding site turnover produces pervasive quantitative changes in transcription factor binding between closely related Drosophila species. PLoS Biol 8:e1000343

pubmed: 20351773 pmcid: 2843597 doi: 10.1371/journal.pbio.1000343

Byrska-Bishop M, Evani U, Zhao X, Basile A, Abel H, Regier A, Corvelo A, Clarke W, Musunuri R, Nagulapalli K et al (2022) High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 185:3426-3440.e19

pubmed: 36055201 pmcid: 9439720 doi: 10.1016/j.cell.2022.08.004

Chavali S, Chavali PL, Chalancon G, deGroot NS, Gemayel R, Latysheva NS, Ing-Simmons E, Verstrepen KJ, Balaji S, Babu MM (2017) Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins. Nat Struct Mol Biol 24:765–777

pubmed: 28805808 pmcid: 5603276 doi: 10.1038/nsmb.3441

Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890

pubmed: 30423086 pmcid: 6129281 doi: 10.1093/bioinformatics/bty560

Cook D, Andersen E (2017) VCF-kit: assorted utilities for the variant call format. Bioinformatics 33:1581–1582

pubmed: 28093408 pmcid: 5423453 doi: 10.1093/bioinformatics/btx011

Cummings CJ, Zoghbi HY (2000) Fourteen and counting: unraveling trinucleotide repeat diseases. Hum Mol Genet 9:909–16

pubmed: 10767314 doi: 10.1093/hmg/9.6.909

DePristo MA, Zilversmit MM, Hartl DL (2006) On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene 378:19–30

pubmed: 16806741 doi: 10.1016/j.gene.2006.03.023

Dickson Z, Golding G (2022) Low complexity regions in mammalian proteins are associated with low protein abundance and high transcript abundance. Mol Biol Evol 39:mcac087

doi: 10.1093/molbev/msac087

Dieringer D, Schlotterer C (2003) Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res 13:2242–2251

pubmed: 14525926 pmcid: 403688 doi: 10.1101/gr.1416703

Dobin A, Davis C, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras T (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21

pubmed: 23104886 doi: 10.1093/bioinformatics/bts635

Dosztányi Z, Csizmók V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839

pubmed: 15769473 doi: 10.1016/j.jmb.2005.01.071

Ebert P, Audano P, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder M, Sulovari A, Ebler J, Zhou W, SerraMari R et al (2021) Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372:abf7177

doi: 10.1126/science.abf7117

Enright J, Dickson Z, Golding G (2023) Low complexity regions in proteins and DNA are poorly correlated. Mol Biol Evol 40:msad084

pubmed: 37036379 pmcid: 10124876 doi: 10.1093/molbev/msad084

Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20:406–416

doi: 10.1093/sysbio/20.4.406

Fomicheva A, Ross E (2021) From prions to stress granules: defining the compositional features of prion-like domains that promote different types of assemblies. Int J Mol Sci 22:1251

pubmed: 33513942 pmcid: 7865556 doi: 10.3390/ijms22031251

Golding GB (1999) Simple sequence is abundant in eukaryotic proteins. Protein Sci 8:1358–61

pubmed: 10386886 pmcid: 2144344 doi: 10.1110/ps.8.6.1358

Gonzalez CE, Roberts P, Ostermeier M (2019) Fitness effects of single amino acid insertions and deletions in tem-1 beta-lactamase. J Mol Biol 431:2320–2330

pubmed: 31034887 pmcid: 6554054 doi: 10.1016/j.jmb.2019.04.030

Goolsby E (2017) Rapid maximum likelihood ancestral state reconstruction of continuous characters: a rerooting-free algorithm. Ecol Evol 7:2791–2797

pubmed: 28428869 pmcid: 5395464 doi: 10.1002/ece3.2837

Grimwood J, Gordon L, Olsen A, Terry A, Schmutz J, Lamerdin J, Hellsten U, Goodstein D, Couronne O, Tran-Gyamfi M et al (2004) The DNA sequence and biology of human chromosome 19. Nature 428:529–535

pubmed: 15057824 doi: 10.1038/nature02399

Haba Y, Kutsukake N (2019) A multivariate phylogenetic comparative method incorporating a flexible function between discrete and continuous traits. Evol Ecol 33:751–768

doi: 10.1007/s10682-019-10011-6

Haerty W, Golding G (2010) Low-complexity sequences and single amino acid repeats: not just “junk” peptide sequences. Genome 53:753–762

pubmed: 20962881 doi: 10.1139/G10-063

Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109

doi: 10.1093/biomet/57.1.97

He Q, Bardet A, Patton B, Purvis J, Johnston J, Paulson A, Gogol M, Stark A, Zeitlinger J (2011) High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nat Genet 43:414–420

pubmed: 21478888 doi: 10.1038/ng.808

Holst L (1980) On the lengths of the pieces of a stick broken at random. J Appl Probab 17:623–634

doi: 10.2307/3212956

Horton C, Alexandari A, Hayes M, Marklund E, Schaepe J, Aditham A, Shah N, Suzuki P, Shrikumar A, Afek A et al (2023) Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381:eadd1250

pubmed: 37733848 doi: 10.1126/science.add1250

Huntley M, Golding G (2000) Evolution of simple sequence in proteins. J Mol Evol 51:131–140

pubmed: 10948269 doi: 10.1007/s002390010073

Huntley M, Golding G (2002) Simple sequences are rare in the protein data bank. Proteins 48:134–140

pubmed: 12012345 doi: 10.1002/prot.10150

Huntley M, Golding G (2006) Selection and slippage creating serine homopolymers. Mol Biol Evol 23:2017–2025

pubmed: 16877497 doi: 10.1093/molbev/msl073

Huntley MA, Golding GB (2006) Selection and slippage creating serine homopolymers. Mol Biol Evol 23:2017–2025

pubmed: 16877497 doi: 10.1093/molbev/msl073

Karlin S, Brocchieri L, Bergman A, Mrázek J, Gentles AJ (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci 99:333–338

pubmed: 11782551 pmcid: 117561 doi: 10.1073/pnas.012608599

Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–80

pubmed: 23329690 pmcid: 3603318 doi: 10.1093/molbev/mst010

Kiefer J (1953) Sequential minimax search for a maximum. Proc Am Math Soc 4:502–506

doi: 10.1090/S0002-9939-1953-0055639-3

Kruglyak S, Durrett R, Schug M, Aquadro C (1998) Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc Natl Acad Sci USA 95:10774–10778

pubmed: 9724780 pmcid: 27971 doi: 10.1073/pnas.95.18.10774

Lenz C, Haerty W, Golding GB (2014) Increased substitution rates surrounding low-complexity regions within primate proteins. Genome Biol Evol 6:655–65

pubmed: 24572016 pmcid: 3971593 doi: 10.1093/gbe/evu042

Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993

pubmed: 21903627 pmcid: 3198575 doi: 10.1093/bioinformatics/btr509

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760

pubmed: 19451168 pmcid: 2705234 doi: 10.1093/bioinformatics/btp324

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

pubmed: 19505943 pmcid: 2723002 doi: 10.1093/bioinformatics/btp352

Lin M, Whitmire S, Chen J, Farrel A, Shi X, Jt Guo (2017) Effects of short indels on protein structure and function in human genomes. Sci Rep 7:9313

pubmed: 28839204 pmcid: 5570956 doi: 10.1038/s41598-017-09287-x

Love M, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550

pubmed: 25516281 pmcid: 4302049 doi: 10.1186/s13059-014-0550-8

Loya T, O’Rourke T, Reines D (2017) The hnRNP-like Nab3 termination factor can employ heterologous prion-like domains in place of its own essential low complexity domain. PLoS ONE 12:e0186187

pubmed: 29023495 pmcid: 5638401 doi: 10.1371/journal.pone.0186187

Marjoram P, Molitor J, Plagnol V, Tavare S (2003) Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci U S A 100:15324–15328

pubmed: 14663152 pmcid: 307566 doi: 10.1073/pnas.0306899100

Martin E, Mittag T (2018) Relationship of sequence and phase separation in protein low-complexity regions. Biochemistry 57:2478–2487

pubmed: 29517898 doi: 10.1021/acs.biochem.8b00008

McGinnis S, Madden T (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32:W20-5

pubmed: 15215342 pmcid: 441573 doi: 10.1093/nar/gkh435

Mier P, Alanis-Lobato G, Andrade-Navarro MA (2017) Context characterization of amino acid homorepeats using evolution, position, and order. Proteins 85:709–719

pubmed: 28097686 doi: 10.1002/prot.25250

Minh B, Schmidt H, Chernomor O, Schrempf D, Woodhams M, vonHaeseler A, Lanfear R (2020) IQ-TREE 2: new models and efficient methods for phylogenetic inference in the Genomic Era. Mol Biol Evol 37:1530–1534

pubmed: 32011700 pmcid: 7182206 doi: 10.1093/molbev/msaa015

Ni X, Zhang Y, Negre N, Chen S, Long M, White K (2012) Adaptive evolution and the birth of CTCF binding sites in the Drosophila genome. PLoS Biol 10:e1001420

pubmed: 23139640 pmcid: 3491045 doi: 10.1371/journal.pbio.1001420

Odom D, Dowell R, Jacobsen E, Gordon W, Danford T, MacIsaac K, Rolfe P, Conboy C, Gifford D, Fraenkel E (2007) Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet 39:730–732

pubmed: 17529977 pmcid: 3797512 doi: 10.1038/ng2047

Pál C, Papp B, Hurst LD (2001) Highly expressed genes in yeast evolve slowly. Genetics 158:927–931

pubmed: 11430355 pmcid: 1461684 doi: 10.1093/genetics/158.2.927

Parry D, North A (1998) Hard alpha-keratin intermediate filament chains: substructure of the N- and C-terminal domains and the predicted structure and function of the C-terminal domains of type I and type II chains. J Struct Biol 122:67–75

pubmed: 9724606 doi: 10.1006/jsbi.1998.3967

Persi E, Wolf Y, Karamycheva S, Makarova K, Koonin E (2023) Compensatory relationship between low-complexity regions and gene paralogy in the evolution of prokaryotes. Proc Natl Acad Sci USA 120:e2300154120

pubmed: 37036997 pmcid: 10120016 doi: 10.1073/pnas.2300154120

Persikov A, Ramshaw J, Kirkpatrick A, Brodsky B (2000) Amino acid propensities for the collagen triple-helix. Biochemistry 39:14960–14967

pubmed: 11101312 doi: 10.1021/bi001560d

Pertea M, Pertea G, Antonescu C, Chang T, Mendell J, Salzberg S (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33:290–295

pubmed: 25690850 pmcid: 4643835 doi: 10.1038/nbt.3122

Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16:1791–1798

pubmed: 10605120 doi: 10.1093/oxfordjournals.molbev.a026091

R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

Revell LJ (2012) Phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol 3:217–223

doi: 10.1111/j.2041-210X.2011.00169.x

Rohlfs R, Harrigan P, Nielsen R (2014) Modeling gene expression evolution with an extended Ornstein–Uhlenbeck process accounting for within-species variation. Mol Biol Evol 31:201–211

pubmed: 24113538 doi: 10.1093/molbev/mst190

Romero P, Obradovic Z, Li X, Garner E, Brown C, Dunker A (2001) Sequence complexity of disordered protein. Proteins 42:38–48

pubmed: 11093259 doi: 10.1002/1097-0134(20010101)42:1<38::AID-PROT50>3.0.CO;2-3

Sainudiin R, Durrett R, Aquadro C, Nielsen R (2004) Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics 168:383–395

pubmed: 15454551 pmcid: 1448085 doi: 10.1534/genetics.103.022665

Schmon S, Gagnon P (2022) Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics. Stat Comput 32:28

pubmed: 35310543 pmcid: 8924149 doi: 10.1007/s11222-022-10080-8

Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, et al. 2016. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. bioRxiv https://www.biorxiv.org/content/early/2016/08/30/072116

Sequencing C, Consortium A (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87

doi: 10.1038/nature04072

Shen W, Ren H (2021) Taxonkit: a practical and efficient ncbi taxonomy toolkit. J Genet Genomics 48:844–850

pubmed: 34001434 doi: 10.1016/j.jgg.2021.03.006

Shi J, Rabosky D (2015) Speciation dynamics during the global radiation of extant bats. Evolution 69:1528–1545

pubmed: 25958922 doi: 10.1111/evo.12681

Shumate A, Salzberg S (2021) Liftoff: accurate mapping of gene annotations. Bioinformatics 37:1639–1643

pubmed: 33320174 pmcid: 8289374 doi: 10.1093/bioinformatics/btaa1016

Stajich J, Block D, Boulez K, Brenner S, Chervitz S, Dagdigian C, Fuellen G, Gilbert J, Korf I, Lapp H et al (2002) The bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618

pubmed: 12368254 pmcid: 187536 doi: 10.1101/gr.361602

Vats D, Flegal JM, Jones GL. (2017). Multivariate output analysis for Markov chain Monte Carlo. arXiv:1512.07713

Villar D, Flicek P, Odom D (2014) Evolution of transcription factor binding in metazoans - mechanisms and functional implications. Nat Rev Genet 15:221–233

pubmed: 24590227 pmcid: 4175440 doi: 10.1038/nrg3481

Wall L, Christiansen T, Orwant J. 2000. Programming perl. " O’Reilly Media, Inc."

Werner M, Sieriebriennikov B, Prabh N, Loschko T, Lanz C, Sommer R (2018) Young genes have distinct gene structure, epigenetic profiles, and transcriptional regulation. Genome Res 28:1675–1687

pubmed: 30232198 pmcid: 6211652 doi: 10.1101/gr.234872.118

Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers Chem 17:149–163

doi: 10.1016/0097-8485(93)85006-X

Zhou K, Shi H, Lyu R, Wylder A, Matuszek Z, Pan J, He C, Parisien M, Pan T (2019) Regulation of co-transcriptional pre-mRNA splicing by m(6)A through the low-complexity protein hnRNPG. Mol Cell 76:70-81.e9

pubmed: 31445886 pmcid: 6778029 doi: 10.1016/j.molcel.2019.07.005