Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech.
Dimensional emotion
Multi-resolution modulation-filtered cochleagram
Parallel long short-term memory network
Temporal modulation
Journal
Neural networks : the official journal of the International Neural Network Society
ISSN: 1879-2782
Titre abrégé: Neural Netw
Pays: United States
ID NLM: 8805018
Informations de publication
Date de publication:
Aug 2021
Aug 2021
Historique:
received:
07
07
2020
revised:
11
02
2021
accepted:
15
03
2021
pubmed:
11
4
2021
medline:
29
6
2021
entrez:
10
4
2021
Statut:
ppublish
Résumé
Continuous dimensional emotion recognition from speech helps robots or virtual agents capture the temporal dynamics of a speaker's emotional state in natural human-robot interactions. Temporal modulation cues obtained directly from the time-domain model of auditory perception can better reflect temporal dynamics than the acoustic features usually processed in the frequency domain. Feature extraction, which can reflect temporal dynamics of emotion from temporal modulation cues, is challenging because of the complexity and diversity of the auditory perception model. A recent neuroscientific study suggests that human brains derive multi-resolution representations through temporal modulation analysis. This study investigates multi-resolution representations of an auditory perception model and proposes a novel feature called multi-resolution modulation-filtered cochleagram (MMCG) for predicting valence and arousal values of emotional primitives. The MMCG is constructed by combining four modulation-filtered cochleagrams at different resolutions to capture various temporal and contextual modulation information. In addition, to model the multi-temporal dependencies of the MMCG, we designed a parallel long short-term memory (LSTM) architecture. The results of extensive experiments on the RECOLA and SEWA datasets demonstrate that MMCG provides the best recognition performance in both datasets among all evaluated features. The results also show that the parallel LSTM can build multi-temporal dependencies from the MMCG features, and the performance on valence and arousal prediction is better than that of a plain LSTM method.
Identifiants
pubmed: 33838592
pii: S0893-6080(21)00115-5
doi: 10.1016/j.neunet.2021.03.027
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
261-273Informations de copyright
Copyright © 2021. Published by Elsevier Ltd.
Déclaration de conflit d'intérêts
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.