Finding features - variable extraction strategies for dimensionality reduction and marker compounds identification in GC-IMS data.
Chemometrics
Food authenticity
Non-target screening
Python
VOC profiling
Journal
Food research international (Ottawa, Ont.)
ISSN: 1873-7145
Titre abrégé: Food Res Int
Pays: Canada
ID NLM: 9210143
Informations de publication
Date de publication:
11 2022
11 2022
Historique:
received:
15
06
2022
revised:
24
07
2022
accepted:
17
08
2022
entrez:
4
10
2022
pubmed:
5
10
2022
medline:
6
10
2022
Statut:
ppublish
Résumé
Gas chromatography hyphenated to ion mobility spectrometry (GC-IMS) is a powerful, two-dimensional separation and detection technique for volatile organic compounds (VOC). Low detection limits, high selectivity and robust operation characterize it as an ideal tool for non-target screening (NTS) approaches. Combined with multivariate data analysis, it has been successfully applied to several areas in food science, such as authenticity control and flavor profiling. The recorded raw data feature high numbers of variables due to the high scan speeds of the instrument. Additionally, NTS approaches - by design - record more data than required. Therefore, reducing the number of variables is a key step in any machine learning pipeline to reduce overfitting, overlong training times and model complexity. The aim of the study is a comparison between the two most used dimensionality reduction techniques, PCA and PLS, regarding interpretability, as a tool to find marker compounds, and performance as a preprocessing step for supervised learning. Both feature per variable visualizations, which allows easy interpretation of results and retains a connection to the input data, which can lead to the discovery of marker compounds. A GC-IMS dataset about the botanical origin of honey is used, and all formatting steps necessary to apply PCA and PLS to higher dimensional data and obtain intuitive figures are explained. To evaluate effectiveness as a preprocessing step in a supervised pipeline four supervised algorithms were fitted with PCA or PLS variable reduction. PLS proved to be a more effective step in a supervised workflow in terms of accuracy, while PCA is highly effective for revealing preprocessing weaknesses such as misalignments.
Identifiants
pubmed: 36192933
pii: S0963-9969(22)00837-7
doi: 10.1016/j.foodres.2022.111779
pii:
doi:
Substances chimiques
Volatile Organic Compounds
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
111779Informations de copyright
Copyright © 2022 Elsevier Ltd. All rights reserved.
Déclaration de conflit d'intérêts
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.