Leveraging protein language models for accurate multiple sequence alignments.

Sequence Alignment Proteins / genetics Amino Acid Sequence Algorithms Amino Acids Language

Journal

Genome research

ISSN: 1549-5469

Titre abrégé: Genome Res

Pays: United States

ID NLM: 9518021

Informations de publication

Date de publication:
07 2023

Historique:

received: 06 01 2023

accepted: 29 06 2023

medline: 28 8 2023

pubmed: 7 7 2023

entrez: 6 7 2023

Statut: ppublish

Résumé

Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.

Identifiants

DOI: 10.1101/gr.277675.123 PMID: 37414576 PMC: PMC10538487

pubmed: 37414576

pii: gr.277675.123

doi: 10.1101/gr.277675.123

pmc: PMC10538487

doi:

Substances chimiques

Proteins 0

Amino Acids 0

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

Pagination

1145-1153

Subventions

Organisme : NIGMS NIH HHS

ID : R01 GM076275

Pays : United States

Informations de copyright

Références

Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9

pubmed: 1438297

Syst Biol. 2019 May 1;68(3):396-411

pubmed: 30329135

J Comput Biol. 2015 May;22(5):377-86

pubmed: 25549288

J Mol Evol. 1980 Sep;16(1):23-36

pubmed: 6449605

J Mol Biol. 2000 Sep 8;302(1):205-17

pubmed: 10964570

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701

pubmed: 33390682

Algorithms Mol Biol. 2006 Apr 19;1(1):6

pubmed: 16722533

BMC Bioinformatics. 2010 Sep 02;11:445

pubmed: 20813050

Sci Rep. 2021 Dec 13;11(1):23916

pubmed: 34903827

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127

pubmed: 34232869

BMC Bioinformatics. 2004 Aug 19;5:113

pubmed: 15318951

Sci Rep. 2016 Sep 27;6:33964

pubmed: 27670777

J Mol Biol. 1991 Jun 5;219(3):555-65

pubmed: 2051488

Genome Biol. 2015 Jun 16;16:124

pubmed: 26076734

Bioinformatics. 2023 Jan 1;39(1):

pubmed: 36355460

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D203-7

pubmed: 14681395

Bioinformatics. 1998 Jun;14(5):407-22

pubmed: 9682054

Mol Syst Biol. 2011 Oct 11;7:539

pubmed: 21988835

Bioinformatics. 2008 Jul 1;24(13):1473-80

pubmed: 18450811

Mol Biol Evol. 2013 Apr;30(4):772-80

pubmed: 23329690

Bioinform Adv. 2021 Nov 19;1(1):vbab035

pubmed: 36700108

Protein Eng. 1999 Feb;12(2):85-94

pubmed: 10195279

Nat Biotechnol. 2022 Nov;40(11):1617-1623

pubmed: 36192636

Genome Res. 2005 Feb;15(2):330-40

pubmed: 15687296

Nature. 2021 Aug;596(7873):583-589

pubmed: 34265844

J Mol Biol. 1970 Mar;48(3):443-53

pubmed: 5420325

Biopolymers. 1983 Dec;22(12):2577-637

pubmed: 6667333

Nat Biotechnol. 2019 Dec;37(12):1466-1470

pubmed: 31792410

Nucleic Acids Res. 1998 Jan 1;26(1):320-2

pubmed: 9399864

J Mol Evol. 1987;25(4):351-60

pubmed: 3118049

J Mol Biol. 1981 Mar 25;147(1):195-7

pubmed: 7265238

PLoS One. 2011;6(12):e28766

pubmed: 22163331

Proteins. 1994 May;19(1):55-72

pubmed: 8066087

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):

pubmed: 33876751

Syst Biol. 2012 Jan;61(1):90-106

pubmed: 22139466

Bioinformatics. 2007 Feb 1;23(3):372-4

pubmed: 17118958

NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043

pubmed: 35702380

Cell Syst. 2021 Jun 16;12(6):654-669.e3

pubmed: 34139171

Hum Genet. 2022 Oct;141(10):1629-1647

pubmed: 34967936

Bioinformatics. 2007 Aug 1;23(15):1875-82

pubmed: 17519246

Bioinformatics. 2020 Jan 1;36(1):90-95

pubmed: 31292629

Leveraging protein language models for accurate multiple sequence alignments.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Claire D McWhite (CD)

Isabel Armour-Garb (I)

Mona Singh (M)

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Exploring structural diversity across the protein universe with The Encyclopedia of Domains.

Decoding the genomic terrain: functional insights into 14 chemosensory proteins in whitefly Bemisia tabaci Asia II-1.

Multilabel SegSRGAN-A framework for parcellation and morphometry of preterm brain in MRI.

Classifications MeSH