Evaluation of input data modality choices on functional gene embeddings.
Journal
NAR genomics and bioinformatics
ISSN: 2631-9268
Titre abrégé: NAR Genom Bioinform
Pays: England
ID NLM: 101756213
Informations de publication
Date de publication:
Dec 2023
Dec 2023
Historique:
received:
16
05
2023
revised:
07
09
2023
accepted:
28
09
2023
medline:
9
11
2023
pubmed:
9
11
2023
entrez:
9
11
2023
Statut:
epublish
Résumé
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
Identifiants
pubmed: 37942285
doi: 10.1093/nargab/lqad095
pii: lqad095
pmc: PMC10629286
doi:
Banques de données
figshare
['10.6084/m9.figshare.13487277']
Types de publication
Journal Article
Langues
eng
Pagination
lqad095Informations de copyright
© The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
Références
IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):666-675
pubmed: 33989156
Ann Indian Acad Neurol. 2022 May-Jun;25(3):562-565
pubmed: 35936629
Cell. 2014 Nov 20;159(5):1212-1226
pubmed: 25416956
Bioinformatics. 2020 Feb 15;36(4):1241-1251
pubmed: 31584634
Nat Methods. 2022 Jul;19(7):774-779
pubmed: 35534633
Nat Genet. 2023 Aug;55(8):1267-1276
pubmed: 37443254
BMC Bioinformatics. 2020 Dec 16;21(Suppl 16):560
pubmed: 33323115
Nat Med. 2019 Jun;25(6):911-919
pubmed: 31160820
Am J Hum Genet. 2019 Mar 7;104(3):466-483
pubmed: 30827497
Nature. 2017 Oct 11;550(7675):204-213
pubmed: 29022597
Seizure. 2017 Jan;44:11-20
pubmed: 28007376
Nat Genet. 2017 Dec;49(12):1779-1784
pubmed: 29083409
Cell Genom. 2022 Aug 15;2(9):100168
pubmed: 36778668
Cell Syst. 2022 Apr 20;13(4):286-303.e10
pubmed: 35085500
Front Cell Dev Biol. 2020 Nov 26;8:600079
pubmed: 33324649
PLoS Comput Biol. 2015 Apr 17;11(4):e1004219
pubmed: 25885710
Sci Rep. 2022 Jul 27;12(1):12801
pubmed: 35896608
Arch Neurol. 2008 Apr;65(4):550-3
pubmed: 18413482
PLoS Biol. 2018 Sep 26;16(9):e3000034
pubmed: 30256779
Bioinformatics. 2020 Jul 1;36(Suppl_1):i417-i426
pubmed: 32657403
PLoS One. 2009 Jun 24;4(6):e5996
pubmed: 19551148
BMC Med Genomics. 2021 Jul 15;14(1):186
pubmed: 34266427
Nat Genet. 2000 May;25(1):25-9
pubmed: 10802651
Proteomics. 2018 Nov;18(21-22):e1800093
pubmed: 30265449
Bioinformatics. 2021 Oct 11;37(19):3328-3336
pubmed: 33822886
BMC Genomics. 2019 Feb 4;20(Suppl 1):82
pubmed: 30712510
Nature. 2023 Feb;614(7948):492-499
pubmed: 36755099
Nat Commun. 2019 Jun 28;10(1):2837
pubmed: 31253775
NPJ Digit Med. 2021 May 20;4(1):86
pubmed: 34017034
Nature. 2020 Apr;580(7803):402-408
pubmed: 32296183
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127
pubmed: 34232869
Nature. 2018 Oct;562(7726):203-209
pubmed: 30305743
Nucleic Acids Res. 2021 Jan 8;49(D1):D605-D612
pubmed: 33237311
Nucleic Acids Res. 2021 Jan 8;49(D1):D1207-D1217
pubmed: 33264411
KDD. 2016 Aug;2016:855-864
pubmed: 27853626
Am J Hum Genet. 2022 May 5;109(5):944-952
pubmed: 35358416
Nat Genet. 2021 May;53(5):638-649
pubmed: 33859415
Proc Natl Acad Sci U S A. 2021 Jan 19;118(3):
pubmed: 33408250
Bioinformatics. 2020 Aug 15;36(14):4180-4188
pubmed: 32379868
Nucleic Acids Res. 2021 Jan 8;49(D1):D1289-D1301
pubmed: 33179738
Nat Genet. 2008 Feb;40(2):189-97
pubmed: 18193044
Cell. 2017 Jul 27;170(3):564-576.e16
pubmed: 28753430
Nucleic Acids Res. 2022 Jan 7;50(D1):D988-D995
pubmed: 34791404
Pac Symp Biocomput. 2018;23:111-122
pubmed: 29218874
PLoS One. 2021 Oct 15;16(10):e0258623
pubmed: 34653224
J Med Genet. 2006 Aug;43(8):691-8
pubmed: 16611749
Elife. 2019 Nov 01;8:
pubmed: 31674305
BMC Med Genomics. 2019 Dec 23;12(Suppl 10):187
pubmed: 31865916