Language models for biological research: a primer.

Artificial Intelligence Computational Biology / methods Humans Natural Language Processing Programming Languages

Journal

Nature methods

ISSN: 1548-7105

Titre abrégé: Nat Methods

Pays: United States

ID NLM: 101215604

Informations de publication

Date de publication:
Aug 2024

Historique:

received: 20 03 2024

accepted: 18 06 2024

medline: 10 8 2024

pubmed: 10 8 2024

entrez: 9 8 2024

Statut: ppublish

Résumé

Language models are playing an increasingly important role in many areas of artificial intelligence (AI) and computational biology. In this primer, we discuss the ways in which language models, both those based on natural language and those based on biological sequences, can be applied to biological research. This primer is primarily intended for biologists interested in using these cutting-edge AI technologies in their applications. We provide guidance on best practices and key resources for adapting language models for biology.

Identifiants

DOI: 10.1038/s41592-024-02354-y PMID: 39122951

pubmed: 39122951

doi: 10.1038/s41592-024-02354-y

pii: 10.1038/s41592-024-02354-y

doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

Pagination

1422-1429

Informations de copyright

Références

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

doi: 10.1038/s41591-023-02448-8 pubmed: 37460753

OpenAI et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). This paper introduces ESM-2, a powerful protein language model, and ESMFold, a model that uses ESM-2 as a foundation to predict protein structure.

Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023). This paper introduces Geneformer, a single-cell language model trained on gene expression profiles of single-cell transcriptomes.

Vaswani, A. et al. In Proc. Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017). This paper introduces the transformer architecture, which powers all of the language models discussed in this paper and much of the field at large.

Jin, W., Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecule optimization. Int. Conf. Learn. Represent. (2019).

Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku (Anthropic, 2024).

Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

doi: 10.1093/bioinformatics/btz682 pubmed: 31501885

Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://doi.org/10.48550/arXiv.2305.09617 (2023).

Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at https://doi.org/10.48550/arXiv.2311.16452 (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 248:1–248:38 (2023).

doi: 10.1145/3571730

Chen, M. et al. Evaluating large language models trained on code. Preprint at https://doi.org/10.48550/arXiv.2107.03374 (2021).

Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

doi: 10.1038/s42256-024-00832-8

Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Preprint at https://doi.org/10.1101/2024.02.27.582234 (2024).

Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

doi: 10.1073/pnas.2016239118 pubmed: 33876751 pmcid: 8053943

Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

doi: 10.1038/s41467-021-22732-w pubmed: 33893299 pmcid: 8065141

Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

doi: 10.1038/s41467-022-32007-7 pubmed: 35896542 pmcid: 9329459

Meier, J. et al. In Proc. Advances in Neural Information Processing Systems 34 (eds. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Wortman Vaughan, J.) 29287–29303 (Curran Associates, 2021).

Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 1512–1522 (2023).

doi: 10.1038/s41588-023-01465-0 pubmed: 37563329 pmcid: 10484790

Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

doi: 10.1038/s41586-021-04043-8 pubmed: 34707284

Ruffolo, J. A. & Madani, A. Designing proteins with language models. Nat. Biotechnol. 42, 200–202 (2024).

doi: 10.1038/s41587-024-02123-4 pubmed: 38361067

Hsu, C., Fannjiang, C. & Listgarten, J. Generative models for protein structures and sequences. Nat. Biotechnol. 42, 196–199 (2024).

doi: 10.1038/s41587-023-02115-w pubmed: 38361069

McWhite, C. D., Armour-Garb, I. & Singh, M. Leveraging protein language models for accurate multiple sequence alignments. Genome Res. 33, 1145–1153 (2023).

Chu, S. K. S. & Siegel, J. B. Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. Preprint at bioRxiv https://doi.org/10.1101/2023.11.19.567747 (2023).

Swanson, K., Chang, H. & Zou, J. In Proc. 17th Machine Learning in Computational Biology Meeting (eds. Knowles, D. A., Mostafavi, S. & Lee, S.-I.) 110–130 (PMLR, 2022).

Jagota, M. et al. Cross-protein transfer learning substantially improves disease variant prediction. Genome Biol. 24, 182 (2023).

doi: 10.1186/s13059-023-03024-6 pubmed: 37550700 pmcid: 10408151

Sledzieski, S. et al. Democratizing protein language models with parameter-efficient fine-tuning. Proc. Natl Acad. Sci. USA 121, e2405840121 (2024).

doi: 10.1073/pnas.2405840121 pubmed: 38900798 pmcid: 11214071

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

doi: 10.1038/s41586-021-03819-2 pubmed: 34265844 pmcid: 8371605

Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).

Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).

doi: 10.1038/s41591-023-02504-3 pubmed: 37592105

Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, 3 (2024).

doi: 10.1056/AIoa2300138

Edwards, C. et al. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 375–413 (Association for Computational Linguistics, 2022).

Chen, Y. & Zou, J. GenePT: a simple but effective foundation model for genes and cells built from ChatGPT. Preprint at bioRxiv https://doi.org/10.1101/2023.10.16.562533 (2024).

Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).

doi: 10.1126/science.adg7492 pubmed: 37733863

Wang, Z. et al. LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction. Sci. Rep. 12, 6832 (2023).

doi: 10.1038/s41598-022-10775-y

Language models for biological research: a primer.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Elana Simon (E)

Kyle Swanson (K)

James Zou (J)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH