EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition.


Journal

Cognitive science
ISSN: 1551-6709
Titre abrégé: Cogn Sci
Pays: United States
ID NLM: 7708195

Informations de publication

Date de publication:
04 2020
Historique:
received: 27 08 2019
revised: 11 12 2019
accepted: 05 02 2020
entrez: 11 4 2020
pubmed: 11 4 2020
medline: 28 8 2021
Statut: ppublish

Résumé

Despite the lack of invariance problem (the many-to-many mapping between acoustics and percepts), human listeners experience phonetic constancy and typically perceive what a speaker intends. Most models of human speech recognition (HSR) have side-stepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, carefully engineered deep learning networks allow robust, real-world automatic speech recognition (ASR). However, the complexities of deep learning architectures and training regimens make it difficult to use them to provide direct insights into mechanisms that may support HSR. In this brief article, we report preliminary results from a two-layer network that borrows one element from ASR, long short-term memory nodes, which provide dynamic memory for a range of temporal spans. This allows the model to learn to map real speech from multiple talkers to semantic targets with high accuracy, with human-like timecourse of lexical access and phonological competition. Internal representations emerge that resemble phonetically organized responses in human superior temporal gyrus, suggesting that the model develops a distributed phonological code despite no explicit training on phonetic or phonemic targets. The ability to work with real speech is a major advance for cognitive models of HSR.

Identifiants

pubmed: 32274861
doi: 10.1111/cogs.12823
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

e12823

Informations de copyright

© 2020 Cognitive Science Society, Inc.

Références

Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the timecourse of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38, 419-439.
Cole, R. A., & Jakimik, J. (1980). A model of speech perception. In R. A. Cole (Ed.), Perception and production of fluent speech (pp. 133-163). Hillsdale, NJ: Erlbaum.
DiCarlo, J. J., & Cox, D. D. (2007). Untangling invariant object recognition. Trends in Cognitive Sciences, 11, 333-341.
Fowler, C. A., & Housum, J. (1987). Talkers' signaling of “new” and “old” words in speech and listeners' perception and use of the distinction. Journal of Memory and Language, 26, 489-504.
Grossberg, S., Boardman, I., & Cohen, M. (1997). Neural dynamics of variable-rate speech categorization. Journal of Experimental Psychology: Human Perception and Performance, 23, 418-503.
Hannagan, T., Magnuson, J. S., & Grainger, J. (2013). Spoken word recognition without a TRACE. Frontiers in Psychology, 4, 563. https://doi.org/10.3389/fpsyg.2013.00563
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29, 82-97.
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer & J. F. Konlen (Eds.), A field guide to dynamical recurrent neural networks (pp. 237-374). New York: IEEE Press.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735-1780.
Joos, M. (1948). Acoustic phonetics. Baltimore, MD: Linguistic Society of America.
Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norma-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98, 630-644.
Kietzmann, T. C., McClure, P., & Kriegeskorte, N. (2019). Deep neural networks in computational neuroscience. In S. Murray Sherman (Ed.), Oxford research encyclopedia of neuroscience. Oxford University Press. https://doi.org/10.1093/acrefore/9780190264086.013.46
Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-Connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 1-28.
Laszlo, S., & Plaut, D. C. (2012). A neurally plausible parallel distributed processing model of event-related potential reading data. Brain and Language, 120, 271-281.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431-461.
Liberman, A. M., Delattre, P. C., & Cooper, F. S. (1952). The role of selected stimulus variables in the perception of the unvoiced-stop consonants. American Journal of Psychology, 65, 497-516.
Magnuson, J. S. (2008). Nondeterminism, pleiotropy, and single word reading: Theoretical and practical concerns. In E. Grigorenko & A. Naples (Eds.), Single word reading (pp. 377-404). Hillsdale, NJ: Erlabaum.
Magnuson, J. S. (2019). Very simple TRACE schematic (Version 1). figshare, https://doi.org/10.6084/m9.figshare.8273261.v1
McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86.
Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343, 1006-1010.
Miller, J. L., & Baer, T. (1983). Some effects of speaking rate on the production of /b/ and /w/. Journal of the Acoustical Society of America, 73, 1751-1755.
Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C., & Botvinick, M. (2018). On the importance of single directions for generalization. arXiv:1803.06959v4.
Nagamine, T., Seltzer, M. L., & Mesgarani, N. (2015). Exploring how deep neural networks form phonemic categories. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-January, (pp. 1912-1916).
Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115, 357-395.
Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of vowels. Journal of the Acoustical Society of America, 24, 175-184.
Scharenborg, O. (2010). Modeling the use of durational information in human spoken-word recognition. Journal of the Acoustical Society of America, 127, 3758-3770.
Scharenborg, O., Norris, D., ten Bosch, L., & McQueen, J. M. (2005). How should a speech recognizer work? Cognitive Science, 29, 867-918.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762v5 [cs.CL].
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339-356.
You, H., & Magnuson, J. S. (2018). TISK 1.0: An easy-to-use Python implementation of the time-invariant string kernel model of spoken word recognition. Behavior Research Methods, 50(3), 871-889. https://doi.org/10.3758/s13428-017-1012-5

Auteurs

James S Magnuson (JS)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.

Heejo You (H)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.

Sahil Luthra (S)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.

Monica Li (M)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.
Haskins Laboratories.

Hosung Nam (H)

Haskins Laboratories.
Department of English Language and Literature, Korea University.

Monty Escabí (M)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.
Electrical and Computer Engineering, University of Connecticut.
Biomedical Engineering, University of Connecticut.

Kevin Brown (K)

Departments of Pharmaceutical Sciences and Chemical, Biological, and Environmental Engineering, Oregon State University.

Paul D Allopenna (PD)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.

Rachel M Theodore (RM)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Speech, Language, and Hearing Sciences, University of Connecticut.

Nicholas Monto (N)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Speech, Language, and Hearing Sciences, University of Connecticut.

Jay G Rueckl (JG)

Connecticut Institute for the Brain and Cognitive Sciences, University of Connecticut.
Psychological Sciences, University of Connecticut.
Haskins Laboratories.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH