Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech.

Cochlea / physiology Cues Emotions Humans Machine Learning Models, Neurological Speech Perception Speech Recognition Software

Dimensional emotion Multi-resolution modulation-filtered cochleagram Parallel long short-term memory network Temporal modulation

Journal

Neural networks : the official journal of the International Neural Network Society

ISSN: 1879-2782

Titre abrégé: Neural Netw

Pays: United States

ID NLM: 8805018

Informations de publication

Date de publication:
Aug 2021

Historique:

received: 07 07 2020

revised: 11 02 2021

accepted: 15 03 2021

pubmed: 11 4 2021

medline: 29 6 2021

entrez: 10 4 2021

Statut: ppublish

Résumé

Continuous dimensional emotion recognition from speech helps robots or virtual agents capture the temporal dynamics of a speaker's emotional state in natural human-robot interactions. Temporal modulation cues obtained directly from the time-domain model of auditory perception can better reflect temporal dynamics than the acoustic features usually processed in the frequency domain. Feature extraction, which can reflect temporal dynamics of emotion from temporal modulation cues, is challenging because of the complexity and diversity of the auditory perception model. A recent neuroscientific study suggests that human brains derive multi-resolution representations through temporal modulation analysis. This study investigates multi-resolution representations of an auditory perception model and proposes a novel feature called multi-resolution modulation-filtered cochleagram (MMCG) for predicting valence and arousal values of emotional primitives. The MMCG is constructed by combining four modulation-filtered cochleagrams at different resolutions to capture various temporal and contextual modulation information. In addition, to model the multi-temporal dependencies of the MMCG, we designed a parallel long short-term memory (LSTM) architecture. The results of extensive experiments on the RECOLA and SEWA datasets demonstrate that MMCG provides the best recognition performance in both datasets among all evaluated features. The results also show that the parallel LSTM can build multi-temporal dependencies from the MMCG features, and the performance on valence and arousal prediction is better than that of a plain LSTM method.

Identifiants

DOI: 10.1016/j.neunet.2021.03.027 PMID: 33838592

pubmed: 33838592

pii: S0893-6080(21)00115-5

doi: 10.1016/j.neunet.2021.03.027

pii:

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

261-273

Informations de copyright

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Déclaration de conflit d'intérêts

Auteurs

Zhichao Peng (Z)

Jianwu Dang (J)

Masashi Unoki (M)

Masato Akagi (M)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH