GAN-based data augmentation for transcriptomics: survey and comparative assessment.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
30 06 2023
Historique:
medline: 3 7 2023
pubmed: 30 6 2023
entrez: 30 6 2023
Statut: ppublish

Résumé

Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.

Identifiants

pubmed: 37387181
pii: 7210506
doi: 10.1093/bioinformatics/btad239
pmc: PMC10311334
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

i111-i120

Informations de copyright

© The Author(s) 2023. Published by Oxford University Press.

Références

Nat Mach Intell. 2021 Jun;3(6):536-544
pubmed: 34179690
Pac Symp Biocomput. 2016;22:219-229
pubmed: 27896977
Bioinformatics. 2022 Jan 12;38(3):730-737
pubmed: 33471074
Comput Struct Biotechnol J. 2014 Nov 15;13:8-17
pubmed: 25750696
Genomics Proteomics Bioinformatics. 2018 Oct;16(5):320-331
pubmed: 30576740
J Big Data. 2021;8(1):101
pubmed: 34306963
Nat Genet. 2013 Oct;45(10):1113-20
pubmed: 24071849
Proc Mach Learn Res. 2019 Jun;97:1528-1537
pubmed: 31777848
PLoS Genet. 2021 Feb 4;17(2):e1009303
pubmed: 33539374
PLoS Comput Biol. 2020 Jul 24;16(7):e1008099
pubmed: 32706788
Bioinformatics. 2020 Aug 15;36(16):4415-4422
pubmed: 32415966
Nat Commun. 2019 Jan 23;10(1):390
pubmed: 30674886
Bioinformatics. 2020 Jul 1;36(Suppl_1):i389-i398
pubmed: 32657401
Nat Methods. 2018 Dec;15(12):1053-1058
pubmed: 30504886
Nat Commun. 2020 Jan 9;11(1):166
pubmed: 31919373
Pac Symp Biocomput. 2018;23:80-91
pubmed: 29218871
Nat Rev Genet. 2015 Jun;16(6):321-32
pubmed: 25948244
Comput Struct Biotechnol J. 2020 Jun 17;18:1466-1473
pubmed: 32637044
BMC Med Res Methodol. 2018 Feb 26;18(1):24
pubmed: 29482517
J Hematol Oncol. 2020 Dec 4;13(1):166
pubmed: 33276803

Auteurs

Alice Lacan (A)

IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France.

Michèle Sebag (M)

TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France.

Blaise Hanczar (B)

IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France.

Articles similaires

Drought Resistance Gene Expression Profiling Gene Expression Regulation, Plant Gossypium Multigene Family
Arabidopsis Arabidopsis Proteins Osmotic Pressure Cytoplasm RNA, Messenger
Animals Natural Killer T-Cells Mice Adipose Tissue Lipid Metabolism
Humans Colorectal Neoplasms Biomarkers, Tumor Prognosis Gene Expression Regulation, Neoplastic

Classifications MeSH