CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis.

Algorithms Chromatography, Liquid Cluster Analysis Data Compression Mass Spectrometry Peptides / chemistry Proteomics / methods

Large-scale cluster analysis Liquid chromatography Mass spectrometry Optimal transport Proteomics Wasserstein kernel

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
12 Feb 2021

Historique:

received: 03 06 2020

accepted: 14 01 2021

entrez: 13 2 2021

pubmed: 14 2 2021

medline: 23 2 2021

Statut: epublish

Résumé

The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles.

CONCLUSIONS CONCLUSIONS

Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Identifiants

DOI: 10.1186/s12859-021-03969-0 PMID: 33579189 PMC: PMC7881590

pubmed: 33579189

doi: 10.1186/s12859-021-03969-0

pii: 10.1186/s12859-021-03969-0

pmc: PMC7881590

doi:

Substances chimiques

Peptides 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

Subventions

Organisme : ProFi project

ID : ANR-10-INBS-08

Organisme : GRAL project

ID : ANR-10-LABX-49-01

Organisme : DATA@UGA and SYMER projects

ID : ANR-15-IDEX-02

Organisme : MIAI @ Grenoble Alpes

ID : ANR-19-P3IA-0003

Références

J Proteome Res. 2019 Jul 5;18(7):2771-2778

pubmed: 31179699

Nat Biotechnol. 2012 Oct;30(10):918-20

pubmed: 23051804

J Proteome Res. 2010 Aug 6;9(8):4152-60

pubmed: 20578722

Anal Chem. 2003 May 15;75(10):2470-7

pubmed: 12918992

Nat Biotechnol. 2008 Dec;26(12):1367-72

pubmed: 19029910

J Proteome Res. 2008 Nov;7(11):4614-22

pubmed: 18800783

Nat Methods. 2016 Aug;13(8):651-656

pubmed: 27493588

Methods Mol Biol. 2011;696:353-67

pubmed: 21063960

Proteomics. 2018 Jul;18(14):e1700454

pubmed: 29882266

J Proteome Res. 2018 May 4;17(5):1993-1996

pubmed: 29682973

Bioinformatics. 2006 Aug 1;22(15):1902-9

pubmed: 16766559

Biostatistics. 2013 Jan;14(1):129-43

pubmed: 22962499

Mass Spectrom Rev. 2014 Nov-Dec;33(6):452-70

pubmed: 24281846

Nat Methods. 2015 Mar;12(3):258-64, 7 p following 264

pubmed: 25599550

Proteomics. 2004 Apr;4(4):950-60

pubmed: 15048977

J Proteome Res. 2019 Jan 4;18(1):571-573

pubmed: 30394750

Nat Methods. 2018 May;15(5):371-378

pubmed: 29608554

Proteomics. 2007 Sep;7(18):3245-58

pubmed: 17708593

Bioinformatics. 2003 Mar 1;19(4):459-66

pubmed: 12611800

J Proteome Res. 2016 Mar 4;15(3):713-20

pubmed: 26653874

Mol Cell Proteomics. 2014 Jun;13(6):1537-42

pubmed: 24677029

J Proteome Res. 2019 Jan 4;18(1):392-398

pubmed: 30394759

Nat Methods. 2011 May 15;8(7):587-91

pubmed: 21572408

J Proteome Res. 2008 Jan;7(1):113-22

pubmed: 18067247

J Proteome Res. 2019 Jan 4;18(1):147-158

pubmed: 30511858

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):128-41

pubmed: 26355513

Bioinformatics. 2018 Aug 15;34(16):2781-2787

pubmed: 29617937

Nat Methods. 2013 Feb;10(2):95-6

pubmed: 23361086

J Proteome Res. 2019 Jan 4;18(1):86-94

pubmed: 30362768

J Am Soc Mass Spectrom. 2005 Aug;16(8):1250-61

pubmed: 15979332

Bioinformatics. 2006 Jul 15;22(14):e132-40

pubmed: 16873463

CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Références

Auteurs

Olga Permiakova (O)

Romain Guibert (R)

Alexandra Kraut (A)

Thomas Fortin (T)

Anne-Marie Hesse (AM)

Thomas Burger (T)

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Multilabel SegSRGAN-A framework for parcellation and morphometry of preterm brain in MRI.

An arithmetic operation P system based on symmetric ternary system.

Unsupervised learning for real-time and continuous gait phase detection.

Classifications MeSH