In the twilight zone of protein sequence homology: do protein language models learn protein structure?


Journal

Bioinformatics advances
ISSN: 2635-0041
Titre abrégé: Bioinform Adv
Pays: England
ID NLM: 9918282081306676

Informations de publication

Date de publication:
2024
Historique:
received: 01 04 2024
revised: 01 08 2024
accepted: 12 08 2024
medline: 26 8 2024
pubmed: 26 8 2024
entrez: 26 8 2024
Statut: epublish

Résumé

Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

Identifiants

pubmed: 39183802
doi: 10.1093/bioadv/vbae119
pii: vbae119
pmc: PMC11344590
doi:

Types de publication

Journal Article

Langues

eng

Pagination

vbae119

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press.

Déclaration de conflit d'intérêts

None declared.

Auteurs

Anowarul Kabir (A)

Department of Computer Science, George Mason University, Fairfax, VA 22030, United States.

Asher Moldwin (A)

Department of Computer Science, George Mason University, Fairfax, VA 22030, United States.

Yana Bromberg (Y)

Department of Computer Science, Emory University, Atlanta, GA 30307, United States.

Amarda Shehu (A)

Department of Computer Science, George Mason University, Fairfax, VA 22030, United States.

Classifications MeSH