Leveraging protein language models for accurate multiple sequence alignments.


Journal

Genome research
ISSN: 1549-5469
Titre abrégé: Genome Res
Pays: United States
ID NLM: 9518021

Informations de publication

Date de publication:
07 2023
Historique:
received: 06 01 2023
accepted: 29 06 2023
medline: 28 8 2023
pubmed: 7 7 2023
entrez: 6 7 2023
Statut: ppublish

Résumé

Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.

Identifiants

pubmed: 37414576
pii: gr.277675.123
doi: 10.1101/gr.277675.123
pmc: PMC10538487
doi:

Substances chimiques

Proteins 0
Amino Acids 0

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

1145-1153

Subventions

Organisme : NIGMS NIH HHS
ID : R01 GM076275
Pays : United States

Informations de copyright

© 2023 McWhite et al.; Published by Cold Spring Harbor Laboratory Press.

Références

Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9
pubmed: 1438297
Syst Biol. 2019 May 1;68(3):396-411
pubmed: 30329135
J Comput Biol. 2015 May;22(5):377-86
pubmed: 25549288
J Mol Evol. 1980 Sep;16(1):23-36
pubmed: 6449605
J Mol Biol. 2000 Sep 8;302(1):205-17
pubmed: 10964570
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701
pubmed: 33390682
Algorithms Mol Biol. 2006 Apr 19;1(1):6
pubmed: 16722533
BMC Bioinformatics. 2010 Sep 02;11:445
pubmed: 20813050
Sci Rep. 2021 Dec 13;11(1):23916
pubmed: 34903827
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127
pubmed: 34232869
BMC Bioinformatics. 2004 Aug 19;5:113
pubmed: 15318951
Sci Rep. 2016 Sep 27;6:33964
pubmed: 27670777
J Mol Biol. 1991 Jun 5;219(3):555-65
pubmed: 2051488
Genome Biol. 2015 Jun 16;16:124
pubmed: 26076734
Bioinformatics. 2023 Jan 1;39(1):
pubmed: 36355460
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D203-7
pubmed: 14681395
Bioinformatics. 1998 Jun;14(5):407-22
pubmed: 9682054
Mol Syst Biol. 2011 Oct 11;7:539
pubmed: 21988835
Bioinformatics. 2008 Jul 1;24(13):1473-80
pubmed: 18450811
Mol Biol Evol. 2013 Apr;30(4):772-80
pubmed: 23329690
Bioinform Adv. 2021 Nov 19;1(1):vbab035
pubmed: 36700108
Protein Eng. 1999 Feb;12(2):85-94
pubmed: 10195279
Nat Biotechnol. 2022 Nov;40(11):1617-1623
pubmed: 36192636
Genome Res. 2005 Feb;15(2):330-40
pubmed: 15687296
Nature. 2021 Aug;596(7873):583-589
pubmed: 34265844
J Mol Biol. 1970 Mar;48(3):443-53
pubmed: 5420325
Biopolymers. 1983 Dec;22(12):2577-637
pubmed: 6667333
Nat Biotechnol. 2019 Dec;37(12):1466-1470
pubmed: 31792410
Nucleic Acids Res. 1998 Jan 1;26(1):320-2
pubmed: 9399864
J Mol Evol. 1987;25(4):351-60
pubmed: 3118049
J Mol Biol. 1981 Mar 25;147(1):195-7
pubmed: 7265238
PLoS One. 2011;6(12):e28766
pubmed: 22163331
Proteins. 1994 May;19(1):55-72
pubmed: 8066087
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):
pubmed: 33876751
Syst Biol. 2012 Jan;61(1):90-106
pubmed: 22139466
Bioinformatics. 2007 Feb 1;23(3):372-4
pubmed: 17118958
NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043
pubmed: 35702380
Cell Syst. 2021 Jun 16;12(6):654-669.e3
pubmed: 34139171
Hum Genet. 2022 Oct;141(10):1629-1647
pubmed: 34967936
Bioinformatics. 2007 Aug 1;23(15):1875-82
pubmed: 17519246
Bioinformatics. 2020 Jan 1;36(1):90-95
pubmed: 31292629

Auteurs

Claire D McWhite (CD)

Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA; cmcwhite@princeton.edu mona@cs.princeton.edu.

Isabel Armour-Garb (I)

Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA.
Department of Computer Science, Princeton University, Princeton, New Jersey 08544, USA.

Mona Singh (M)

Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA; cmcwhite@princeton.edu mona@cs.princeton.edu.
Department of Computer Science, Princeton University, Princeton, New Jersey 08544, USA.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
Animals Hemiptera Insect Proteins Phylogeny Insecticides
1.00
Humans Magnetic Resonance Imaging Brain Infant, Newborn Infant, Premature

Classifications MeSH