Organizing the bacterial annotation space with amino acid sequence embeddings.
Bacteria
Function prediction
Machine learning
Protein ontology
Sequence embedding
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
23 Sep 2022
23 Sep 2022
Historique:
received:
30
05
2022
accepted:
11
08
2022
entrez:
23
9
2022
pubmed:
24
9
2022
medline:
28
9
2022
Statut:
epublish
Résumé
Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental challenge in computational biology. Currently, known protein annotations are organized in human-curated ontologies, however, all possible protein functions may not be organized accurately. Meanwhile, recent advancements in natural language processing and machine learning have developed models which embed amino acid sequences as vectors in n-dimensional space. So far, these embeddings have primarily been used to classify protein sequences using manually constructed protein classification schemes. In this work, we describe the use of amino acid sequence embeddings as a systematic framework for studying protein ontologies. Using a sequence embedding, we show that the bacterial carbohydrate metabolism class within the SEED annotation system contains 48 clusters of embedded sequences despite this class containing 29 functional labels. Furthermore, by embedding Bacillus amino acid sequences with unknown functions, we show that these unknown sequences form clusters that are likely to have similar biological roles. This study demonstrates that amino acid sequence embeddings may be a powerful tool for developing more robust ontologies for annotating protein sequence data. In addition, embeddings may be beneficial for clustering protein sequences with unknown functions and selecting optimal candidate proteins to characterize experimentally.
Sections du résumé
BACKGROUND
BACKGROUND
Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental challenge in computational biology. Currently, known protein annotations are organized in human-curated ontologies, however, all possible protein functions may not be organized accurately. Meanwhile, recent advancements in natural language processing and machine learning have developed models which embed amino acid sequences as vectors in n-dimensional space. So far, these embeddings have primarily been used to classify protein sequences using manually constructed protein classification schemes.
RESULTS
RESULTS
In this work, we describe the use of amino acid sequence embeddings as a systematic framework for studying protein ontologies. Using a sequence embedding, we show that the bacterial carbohydrate metabolism class within the SEED annotation system contains 48 clusters of embedded sequences despite this class containing 29 functional labels. Furthermore, by embedding Bacillus amino acid sequences with unknown functions, we show that these unknown sequences form clusters that are likely to have similar biological roles.
CONCLUSIONS
CONCLUSIONS
This study demonstrates that amino acid sequence embeddings may be a powerful tool for developing more robust ontologies for annotating protein sequence data. In addition, embeddings may be beneficial for clustering protein sequences with unknown functions and selecting optimal candidate proteins to characterize experimentally.
Identifiants
pubmed: 36151519
doi: 10.1186/s12859-022-04930-5
pii: 10.1186/s12859-022-04930-5
pmc: PMC9502642
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
385Subventions
Organisme : NIDDK NIH HHS
ID : RC2 DK116713
Pays : United States
Organisme : Australian Research Council
ID : DP220102915
Organisme : NIH HHS
ID : RC2DK116713
Pays : United States
Informations de copyright
© 2022. The Author(s).
Références
Neurochem Res. 2019 Jan;44(1):247-257
pubmed: 29327308
Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860
pubmed: 29112715
BMC Med Genomics. 2018 Apr 20;11(Suppl 2):33
pubmed: 29697361
Nat Methods. 2020 Mar;17(3):261-272
pubmed: 32015543
PLoS One. 2015 Nov 10;10(11):e0141287
pubmed: 26555596
PLoS Comput Biol. 2020 Nov 2;16(11):e1007845
pubmed: 33137102
Bioinformatics. 2015 Nov 15;31(22):3718-20
pubmed: 26209431
J Mol Biol. 2001 Nov 2;313(4):903-19
pubmed: 11697912
Front Immunol. 2021 Jul 22;12:680687
pubmed: 34367141
Genomics. 2019 Dec;111(6):1298-1305
pubmed: 30195069
Front Physiol. 2019 Dec 10;10:1501
pubmed: 31920706
Methods Mol Biol. 2014;1079:105-16
pubmed: 24170397
Bioinformatics. 2018 Jul 1;34(13):i254-i262
pubmed: 29949966
J Biomed Inform. 2012 Feb;45(1):173-83
pubmed: 22079474
Nucleic Acids Res. 2017 Jan 4;45(D1):D535-D542
pubmed: 27899627
Bioinformatics. 2012 Dec 1;28(23):3150-2
pubmed: 23060610
Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9
pubmed: 1438297
Cytometry A. 2021 Apr;99(4):399-406
pubmed: 33140503
Bioinformatics. 2021 May 5;37(6):737-743
pubmed: 33241321
Nucleic Acids Res. 2014 Jan;42(Database issue):D206-14
pubmed: 24293654
Bioinformatics. 2018 Jul 1;34(13):i295-i303
pubmed: 29949957
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24
pubmed: 27342282
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61
pubmed: 14681407
Nat Biotechnol. 2022 Jun;40(6):932-937
pubmed: 35190689
Bioinformatics. 2021 Apr 19;37(2):162-170
pubmed: 32797179
Nat Ecol Evol. 2018 Jun;2(6):936-943
pubmed: 29662222
Nucleic Acids Res. 2014 Jan;42(Database issue):D581-91
pubmed: 24225323
Protein Eng. 2000 Mar;13(3):149-52
pubmed: 10775656
Appl Microbiol Biotechnol. 2018 Jan;102(1):81-92
pubmed: 29151158
Nat Methods. 2019 Dec;16(12):1315-1322
pubmed: 31636460
BMC Bioinformatics. 2019 Dec 17;20(1):723
pubmed: 31847804
BMC Bioinformatics. 2020 Jun 9;21(1):235
pubmed: 32517697
Front Bioeng Biotechnol. 2020 Apr 29;8:391
pubmed: 32411695
Nat Biotechnol. 2020 Sep;38(9):1079-1086
pubmed: 32341564
J Theor Biol. 2014 Dec 21;363:145-50
pubmed: 25158165
Nucleic Acids Res. 2003 Jan 1;31(1):365-70
pubmed: 12520024
Nature. 2008 Apr 3;452(7187):629-32
pubmed: 18337718
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):
pubmed: 33876751
Nat Commun. 2019 Jul 15;10(1):3100
pubmed: 31308405
PLoS Comput Biol. 2013;9(5):e1003063
pubmed: 23737737
J Oral Microbiol. 2020 Mar 30;12(1):1741254
pubmed: 32341758
Bioinformatics. 2021 May 12;:
pubmed: 33978744