CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis.
Large-scale cluster analysis
Liquid chromatography
Mass spectrometry
Optimal transport
Proteomics
Wasserstein kernel
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
12 Feb 2021
12 Feb 2021
Historique:
received:
03
06
2020
accepted:
14
01
2021
entrez:
13
2
2021
pubmed:
14
2
2021
medline:
23
2
2021
Statut:
epublish
Résumé
The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.
Sections du résumé
BACKGROUND
BACKGROUND
The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms.
RESULTS
RESULTS
We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles.
CONCLUSIONS
CONCLUSIONS
Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.
Identifiants
pubmed: 33579189
doi: 10.1186/s12859-021-03969-0
pii: 10.1186/s12859-021-03969-0
pmc: PMC7881590
doi:
Substances chimiques
Peptides
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
68Subventions
Organisme : ProFi project
ID : ANR-10-INBS-08
Organisme : GRAL project
ID : ANR-10-LABX-49-01
Organisme : DATA@UGA and SYMER projects
ID : ANR-15-IDEX-02
Organisme : MIAI @ Grenoble Alpes
ID : ANR-19-P3IA-0003
Références
J Proteome Res. 2019 Jul 5;18(7):2771-2778
pubmed: 31179699
Nat Biotechnol. 2012 Oct;30(10):918-20
pubmed: 23051804
J Proteome Res. 2010 Aug 6;9(8):4152-60
pubmed: 20578722
Anal Chem. 2003 May 15;75(10):2470-7
pubmed: 12918992
Nat Biotechnol. 2008 Dec;26(12):1367-72
pubmed: 19029910
J Proteome Res. 2008 Nov;7(11):4614-22
pubmed: 18800783
Nat Methods. 2016 Aug;13(8):651-656
pubmed: 27493588
Methods Mol Biol. 2011;696:353-67
pubmed: 21063960
Proteomics. 2018 Jul;18(14):e1700454
pubmed: 29882266
J Proteome Res. 2018 May 4;17(5):1993-1996
pubmed: 29682973
Bioinformatics. 2006 Aug 1;22(15):1902-9
pubmed: 16766559
Biostatistics. 2013 Jan;14(1):129-43
pubmed: 22962499
Mass Spectrom Rev. 2014 Nov-Dec;33(6):452-70
pubmed: 24281846
Nat Methods. 2015 Mar;12(3):258-64, 7 p following 264
pubmed: 25599550
Proteomics. 2004 Apr;4(4):950-60
pubmed: 15048977
J Proteome Res. 2019 Jan 4;18(1):571-573
pubmed: 30394750
Nat Methods. 2018 May;15(5):371-378
pubmed: 29608554
Proteomics. 2007 Sep;7(18):3245-58
pubmed: 17708593
Bioinformatics. 2003 Mar 1;19(4):459-66
pubmed: 12611800
J Proteome Res. 2016 Mar 4;15(3):713-20
pubmed: 26653874
Mol Cell Proteomics. 2014 Jun;13(6):1537-42
pubmed: 24677029
J Proteome Res. 2019 Jan 4;18(1):392-398
pubmed: 30394759
Nat Methods. 2011 May 15;8(7):587-91
pubmed: 21572408
J Proteome Res. 2008 Jan;7(1):113-22
pubmed: 18067247
J Proteome Res. 2019 Jan 4;18(1):147-158
pubmed: 30511858
IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):128-41
pubmed: 26355513
Bioinformatics. 2018 Aug 15;34(16):2781-2787
pubmed: 29617937
Nat Methods. 2013 Feb;10(2):95-6
pubmed: 23361086
J Proteome Res. 2019 Jan 4;18(1):86-94
pubmed: 30362768
J Am Soc Mass Spectrom. 2005 Aug;16(8):1250-61
pubmed: 15979332
Bioinformatics. 2006 Jul 15;22(14):e132-40
pubmed: 16873463