Bridging auditory perception and natural language processing with semantically informed deep neural networks.

Acoustic-to-semantic transformation Auditory perception Cognitive neuroscience Deep neural networks Natural language processing Semantic embeddings Sound recognition

Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
09 09 2024
Historique:
received: 21 12 2023
accepted: 30 08 2024
medline: 10 9 2024
pubmed: 10 9 2024
entrez: 9 9 2024
Statut: epublish

Résumé

Sound recognition is effortless for humans but poses a significant challenge for artificial hearing systems. Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have recently surpassed traditional machine learning in sound classification. However, current DNNs map sounds to labels using binary categorical variables, neglecting the semantic relations between labels. Cognitive neuroscience research suggests that human listeners exploit such semantic information besides acoustic cues. Hence, our hypothesis is that incorporating semantic information improves DNN's sound recognition performance, emulating human behaviour. In our approach, sound recognition is framed as a regression problem, with CNNs trained to map spectrograms to continuous semantic representations from NLP models (Word2Vec, BERT, and CLAP text encoder). Two DNN types were trained: semDNN with continuous embeddings and catDNN with categorical labels, both with a dataset extracted from a collection of 388,211 sounds enriched with semantic descriptions. Evaluations across four external datasets, confirmed the superiority of semantic labeling from semDNN compared to catDNN, preserving higher-level relations. Importantly, an analysis of human similarity ratings for natural sounds, showed that semDNN approximated human listener behaviour better than catDNN, other DNNs, and NLP models. Our work contributes to understanding the role of semantics in sound recognition, bridging the gap between artificial systems and human auditory perception.

Identifiants

pubmed: 39251659
doi: 10.1038/s41598-024-71693-9
pii: 10.1038/s41598-024-71693-9
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

20994

Subventions

Organisme : Agence Nationale de la Recherche
ID : ANR-21-CE37-0027-01
Organisme : Nederlandse Organisatie voor Wetenschappelijk Onderzoek
ID : 406.20.GO.030

Informations de copyright

© 2024. The Author(s).

Références

Bansal, A. & Garg, N. K. Environmental sound classification: A descriptive review of the literature. Intell. Syst. Appl. 16, 200115 (2022).
Fukushima, K. & Miyake, S. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets 267–285 (Springer, 1982).
doi: 10.1007/978-3-642-46466-9_18
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F. et al. CNN architectures for large-scale audio classification. In Proc. ICASSP 2017, pp. 131–135 (2017).
Huang, J. J. & Leanos, J. J. A. AcINET: Efficient end-to-end audio classification CNN. arXiv preprint arXiv:1811.06669 (2018).
Kong, Q. et al. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 28(1), 2880–2894. https://doi.org/10.1109/TASLP.2020.3030497 (2020) arXiv:1912.10211 .
doi: 10.1109/TASLP.2020.3030497
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M. & Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans (2017).
Grooby, E. et al. Real-time multi-level neonatal heart and lung sound quality assessment for telehealth applications. IEEE Access 10, 10934–10948 (2022).
doi: 10.1109/ACCESS.2022.3144355
Mariscal-Harana, J. et al. Audio-based aircraft detection system for safe RPAS BVLOS operations. Electronics 9(12), 2076 (2020).
doi: 10.3390/electronics9122076
Scheidwasser-Clow, N., Kegler, M., Beckmann, P. & Cernak, M. SERAB: A multi-lingual benchmark for speech emotion recognition. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 7697–7701 (IEEE, 2022).
doi: 10.1109/ICASSP43922.2022.9747348
Fonseca, E., Favory, X., Pons, J., Font, F. & Serra, X. FSD50K: An open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 829–852. https://doi.org/10.1109/TASLP.2021.3133208 . arXiv:2010.00475 (2022).
Jimenez, A., Elizalde, B. & Raj, B. Sound event classification using ontology-based neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, vol. 9 (2018).
Gregg, M. K. & Samuel, A. G. The importance of semantics in auditory representations. Atten. Percept. Psychophys. 71, 607–619 (2009).
doi: 10.3758/APP.71.3.607 pubmed: 19304650
Giordano, B. L., Esposito, M., Valente, G. & Formisano, E. Intermediate acoustic-to-semantic representations link behavioral and neural responses to natural sounds. Nat. Neurosci. 26, 664–672 (2023).
doi: 10.1038/s41593-023-01285-9 pubmed: 36928634 pmcid: 10076214
Giordano, B. L., McDonnell, J. & McAdams, S. Hearing living symbols and nonliving icons: Category-specificities in the cognitive processing of environmental sounds. Brain Cogn. 73, 7–19 (2010).
doi: 10.1016/j.bandc.2010.01.005 pubmed: 20188452
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. Clap: Learning audio concepts from natural language supervision. arXiv preprint arXiv:2206.04769 (2022).
SoundIdeasInc. SuperHardDriveCombo. https://www.sound-ideas.com/Product/28/Super-Hard-Drive-Combo [Online].
Xie, H., Räsänen, O. & Virtanen, T. Zero-shot audio classification with factored linear and nonlinear acoustic-semantic projections. arXiv. https://doi.org/10.48550/ARXIV.2011.12657 . https://arxiv.org/abs/2011.12657 (2020).
Heller, L. M., Elizalde, B., Raj, B. & Deshmukh, S. Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session (2023).
van der Maaten, L. & Hinton, G. Visualizing data using T-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning 448–456 (PMLR, 2015).
Wang, J., Zhou, F., Wen, S., Liu, X., & Lin, Y. Deep Metric Learning with Angular Loss (2017).
Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014).
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017).
doi: 10.1162/tacl_a_00051
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (O’Reilly Media Inc., 2022).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition (2015).
Inc., S. DigiEffects-dataset. https://www.sound-ideas.com/Product/130/Digiffects-Sound-Effects-Library [Online].
SoundIdeasInc. GenHD-dataset. https://www.sound-ideas.com/Product/27/The-General-HD-Sound-Effects-Collection [Online].
SoundIdeasInc. Holliwood-dataset. https://www.sound-ideas.com/Collection/3/2/0/The-Hollywood-Edge-Sound-Effects-Libraries [Online].
SoundIdeasInc. Mike-McDonough. https://www.sound-ideas.com/Product/919/Mike-McDonough-SFX-Collection-on-Hard-Drive [Online].
SoundIdeasInc. Seraphine. https://www.sound-ideas.com/Product/735/Frank-Serafine-Sound-Effects-Hard-Drive [Online].
SoundIdeasInc. SoundStorm-dataset. https://www.sound-ideas.com/Product/376/Soundstorm-Sound-Effects-Library [Online].
SoundIdeasInc. Ultimate-dataset. https://www.sound-ideas.com/Product/508/Sound-Ideas-Ultimate-Sound-Effects-Collection [Online].
Google: GoogleNewsPretrainedModel. https://code.google.com/archive/p/word2vec/ [Online] (2013).
Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl. Acad. Sci. USA 118(8), 1–9. https://doi.org/10.1073/pnas.2011417118 (2021).
doi: 10.1073/pnas.2011417118
Murtagh, F. & Legendre, P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion?. J. Classif. 31, 274–295 (2014).
doi: 10.1007/s00357-014-9161-z
Piczak, K. J. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018 (2015).
Salamon, J., Jacoby, C. & Bello, J. P. A dataset and taxonomy for urban sound research. MM 2014—Proceedings of the 2014 ACM Conference on Multimedia (3), pp. 1041–1044. https://doi.org/10.1145/2647868.2655045 (2014).
CVSSP: MSOS-dataset. https://cvssp.org/projects/makingsenseofsounds/site/challenge/ [Online] (2013).
Lawson, C. L. & Hanson, R. J. Solving Least Squares Problems (SIAM, 1995).
doi: 10.1137/1.9781611971217
Inc., S. Super Hard Drive Combo. https://www.sound-ideas.com/Product/28/Super-Hard-Drive-Combo/ [Online] (2022).
Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron 98, 630–64416 (2018).
doi: 10.1016/j.neuron.2018.03.044 pubmed: 29681533
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W. & Plumbley, M. D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition (2020).
Drossos, K., Lipping, S. & Virtanen, T. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740 (IEEE, 2020).
Giordano, B. L., de Miranda Azevedo, R., Plasencia-Calaña, Y., Formisano, E. & Dumontier, M. What do we mean with sound semantics, exactly? A survey of taxonomies and ontologies of everyday sounds. Front. Psychol. 13, 964209 (2022).
doi: 10.3389/fpsyg.2022.964209 pubmed: 36312201 pmcid: 9601315
Inc., S. SoundIdeasLicense. https://www.sound-ideas.com/Page/Sound-Ideas-End-User-License [Online] (2022).
Ladjal, S., Newson, A. & Pham, C.-H. A PCA-like Autoencoder (2019).
Pham, C.-H., Ladjal, S. & Newson, A. PCA-AE: Principal component analysis autoencoder for organising the latent space of generative networks. J. Math. Imaging Vis. 64(5), 569–585 (2022).
doi: 10.1007/s10851-022-01077-z

Auteurs

Michele Esposito (M)

Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands. m.esposito@maastrichtuniversity.nl.

Giancarlo Valente (G)

Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands.

Yenisel Plasencia-Calaña (Y)

BISS Institute, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands.

Michel Dumontier (M)

Institute of Data Science, Maastricht University, Maastricht, The Netherlands.

Bruno L Giordano (BL)

Institut des Neurosciences de La Timone, CNRS UMR 7289-Université Aix-Marseille, Marseille, France.

Elia Formisano (E)

Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands. e.formisano@maastrichtuniversity.nl.
BISS Institute, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands. e.formisano@maastrichtuniversity.nl.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH