Evaluating Protein Transfer Learning with TAPE.
Journal
Advances in neural information processing systems
ISSN: 1049-5258
Titre abrégé: Adv Neural Inf Process Syst
Pays: United States
ID NLM: 9607483
Informations de publication
Date de publication:
Dec 2019
Dec 2019
Historique:
entrez:
4
1
2021
pubmed:
1
12
2019
medline:
1
12
2019
Statut:
ppublish
Résumé
Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
Types de publication
Journal Article
Langues
eng
Pagination
9689-9701Subventions
Organisme : NIGMS NIH HHS
ID : R01 GM094402
Pays : United States
Références
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432
pubmed: 30357350
BMC Bioinformatics. 2019 Jun 11;20(1):311
pubmed: 31185886
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W244-8
pubmed: 15980461
Front Microbiol. 2013 Dec 31;4:412
pubmed: 24427156
Proteins. 1999 Mar 1;34(4):508-19
pubmed: 10081963
Nucleic Acids Res. 2000 Jan 1;28(1):235-42
pubmed: 10592235
Protein Eng. 1999 Feb;12(2):85-94
pubmed: 10195279
BMC Bioinformatics. 2019 Dec 17;20(1):723
pubmed: 31847804
Proteins. 2019 Jun;87(6):520-527
pubmed: 30785653
Cell. 2018 Mar 8;172(6):1181-1197
pubmed: 29522741
Protein Eng. 1988 Sep;2(3):193-9
pubmed: 3237684
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Bioinformatics. 2018 Aug 1;34(15):2642-2648
pubmed: 29584811
Science. 1964 Dec 18;146(3651):1593-4
pubmed: 14224506
Nat Methods. 2018 Oct;15(10):816-822
pubmed: 30250057
Proteins. 2014 Feb;82 Suppl 2:208-18
pubmed: 23900763
Nat Methods. 2019 Dec;16(12):1315-1322
pubmed: 31636460
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515
pubmed: 30395287
Proc Natl Acad Sci U S A. 1961 Sep 15;47:1309-14
pubmed: 13683522
Proteins. 2018 Mar;86 Suppl 1:51-66
pubmed: 29071738
Brief Bioinform. 2018 May 1;19(3):482-494
pubmed: 28040746
Nucleic Acids Res. 2015 Jul 1;43(W1):W389-94
pubmed: 25883141
Science. 2017 Jul 14;357(6347):168-175
pubmed: 28706065
Nature. 2019 Feb;566(7743):218-223
pubmed: 30718774
Proc Natl Acad Sci U S A. 2015 Dec 29;112(52):15898-903
pubmed: 26578815
Curr Pharm Des. 2009;15(8):893-916
pubmed: 19275653
Bioinformatics. 2014 Nov 1;30(21):3128-30
pubmed: 25064567
Bioinformatics. 1998;14(9):755-63
pubmed: 9918945
Nucleic Acids Res. 2011 Jan;39(Database issue):D411-9
pubmed: 21071423
Nat Methods. 2011 Dec 25;9(2):173-5
pubmed: 22198341
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402
pubmed: 9254694
Proteins. 2018 Mar;86 Suppl 1:7-15
pubmed: 29082672
Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9
pubmed: 24304899
Bioinformatics. 2015 Nov 1;31(21):3506-13
pubmed: 26275894
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):
pubmed: 33876751
Proc Natl Acad Sci U S A. 1998 May 26;95(11):6073-8
pubmed: 9600919
Nature. 2016 May 11;533(7603):397-401
pubmed: 27193686
Bioinformatics. 2018 Apr 15;34(8):1295-1303
pubmed: 29228193
Nat Biotechnol. 2012 May 27;30(6):543-8
pubmed: 22634563