SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan.
Catalan language
Contextual diversity
Subtitles
Word frequency
Journal
Behavior research methods
ISSN: 1554-3528
Titre abrégé: Behav Res Methods
Pays: United States
ID NLM: 101244316
Informations de publication
Date de publication:
02 2020
02 2020
Historique:
pubmed:
22
3
2019
medline:
12
8
2020
entrez:
22
3
2019
Statut:
ppublish
Résumé
SUBTLEX-CAT is a word frequency and contextual diversity database for Catalan, obtained from a 278-million-word corpus based on subtitles supplied from broadcast Catalan television. Like all previous SUBTLEX corpora, it comprises subtitles from films and TV series. In addition, it includes a wider range of TV shows (e.g., news, documentaries, debates, and talk shows) than has been included in most previous databases. Frequency metrics were obtained for the whole corpus, on the one hand, and only for films and fiction TV series, on the other. Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates, computed from either written texts or texts from the Internet. Furthermore, the metrics obtained from the whole corpus were better predictors than the ones obtained from films and fiction TV series alone. In both experiments, the best predictor of response times and accuracy was contextual diversity.
Identifiants
pubmed: 30895456
doi: 10.3758/s13428-019-01233-1
pii: 10.3758/s13428-019-01233-1
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
360-375Références
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17, 814–823. https://doi.org/10.1111/j.1467-9280.2006.01787.x
doi: 10.1111/j.1467-9280.2006.01787.x
pubmed: 16984300
Alameda, J. R., & Cuetos, F. (1995). Diccionario de frecuencias de las unidades lingüísticas del castellano. Oviedo, Spain: Servicio de Publicaciones de la Universidad de Oviedo.
Avdyli, S. R., & Cuetos, S. F. (2013). SUBTLEX-AL: Albanian word frequencies based on film subtitles. ILIRIA International Review, 3, 285–292. https://doi.org/10.3389/fpsyg.2010.00218
doi: 10.3389/fpsyg.2010.00218
Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD ROM). Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania.
Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283–316. https://doi.org/10.1037/0096-3445.133.2.283
doi: 10.1037/0096-3445.133.2.283
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., . . . Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459. https://doi.org/10.3758/BF03193014
doi: 10.3758/BF03193014
Bonin, P., Chalard, M., Méot, A., & Fayol, M. (2001). Age-of-acquisition and word frequency in the lexical decision task: Further evidence from the French language. Current Psychology of Cognition, 20, 401–443.
Brants, T., & Franz, A. (2006). Web 1T 5-gram, version 1. Philadelphia, PA: Linguistic Data Consortium.
Branzi, F. M., Calabria, M., Boscarino, M. L., & Costa, A. (2016). On the overlap between bilingual language control and domain-general executive control. Acta Psychologica, 166, 21–30. https://doi.org/10.1016/j.actpsy.2016.03.001
doi: 10.1016/j.actpsy.2016.03.001
pubmed: 27043252
Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412–424. https://doi.org/10.1027/1618-3169/a000123
doi: 10.1027/1618-3169/a000123
pubmed: 21768069
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990. https://doi.org/10.3758/BRM.41.4.977
doi: 10.3758/BRM.41.4.977
pubmed: 19897807
Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, Instruments, & Computers, 30, 272–277. https://doi.org/10.3758/BF03200655
doi: 10.3758/BF03200655
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS ONE, 5, e10729. https://doi.org/10.1371/journal.pone.0010729
doi: 10.1371/journal.pone.0010729
pubmed: 2880003
pmcid: 2880003
Calabria, M., Branzi, F. M., Marne, P., Hernández, M., & Costa, A. (2015). Age-related effects over bilingual language control and executive control. Bilingualism: Language and Cognition, 18, 65–78. https://doi.org/10.1017/S1366728913000138
doi: 10.1017/S1366728913000138
Calabria, M., Cattaneo, G., Marne, P., Hernández, M., Juncadella, M., Gascón-Bayarri, J., . . . Costa, A. (2017). Language deterioration in bilingual Alzheimer’s disease patients: A longitudinal study. Journal of Neurolinguistics, 43, 59–74. https://doi.org/10.1016/j.jneuroling.2016.06.005
doi: 10.1016/j.jneuroling.2016.06.005
Calabria, M., Marne, P., Romero-Pinel, L., Juncadella, M., & Costa, A. (2014). Losing control of your languages: A case study. Cognitive Neuropsychology, 31, 266–286. https://doi.org/10.1080/02643294.2013.879443
doi: 10.1080/02643294.2013.879443
pubmed: 24499376
Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI (pp. 535–555). Hillsdale, NJ: Erlbaum.
Comesaña, M., Ferré, P., Romero, J., Guasch, M., Soares, A. P., & García-Chico, T. (2015). Facilitative effect of cognate words vanishes when reducing the orthographic overlap: The effect of stimuli list composition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 41, 614–635. https://doi.org/10.1037/xlm0000065
doi: 10.1037/xlm0000065
pubmed: 25329085
Cortese, M. J., & Khanna, M. M. (2007). Age of acquisition predicts naming and lexical-decision performance above and beyond 22 other predictor variables: An analysis of 2,342 words. Quarterly Journal of Experimental Psychology, 60, 1072–1082. https://doi.org/10.1080/17470210701315467
doi: 10.1080/17470210701315467
Cortese, M. J., Khanna, M. M., & Hacker, S. (2010). Recognition memory for 2,578 monosyllabic words. Memory, 18, 595–609. https://doi.org/10.1080/09658211.2010.493892
doi: 10.1080/09658211.2010.493892
pubmed: 20677075
Cuetos, F., González-Nosti, M., Barbón, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicológica, 32, 133–143.
De Mauro, T., Mancini, F., Vedovelli, M., & Voghera, M. (1993). Lessico di frequenza dell’italiano parlato (LIP). Milan, Italy: Etaslibri.
Dimitropoulou, M., Duñabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behavior: The case of Greek. Frontiers in Psychology, 1, 218. https://doi.org/10.3389/fpsyg.2010.00218
doi: 10.3389/fpsyg.2010.00218
pubmed: 21833273
pmcid: 3153823
Duchon, A., Perea, M., Sebastián-Gallés, N., Martí, A., & Carreiras, M. (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods, 45, 1246–1258. https://doi.org/10.3758/s13428-013-0326-1
doi: 10.3758/s13428-013-0326-1
pubmed: 23468181
Equipe DELIC. (2004). Présentation du Corpus de référence du français parlé. Recherches sur le Français Parlé, 18, 11–42.
Ferré, P., Anglada-Tort, M., & Guasch, M. (2018). Processing of emotional words in bilinguals: Testing the effects of word concreteness, task type and language status. Second Language Research, 34, 371–394. https://doi.org/10.1177/0267658317744008
doi: 10.1177/0267658317744008
Ferré, P., García, T., Fraga, I., Sánchez-Casas, R., & Molero, M. (2010). Memory for emotional words in bilinguals: Do words have the same emotional intensity in the first and in the second language? Cognition and Emotion, 24, 760–785. https://doi.org/10.1080/02699930902985779
doi: 10.1080/02699930902985779
Ferré, P., Sánchez-Casas, R., & Fraga, I. (2013). Memory for emotional words in the first and the second language: Effects of the encoding task. Bilingualism: Language and Cognition, 16, 495–507. https://doi.org/10.1017/S1366728912000314
doi: 10.1017/S1366728912000314
Ferré, P., Sánchez-Casas, R., & Guasch, M. (2006). Can a horse be a donkey? Semantic and form interference effects in translation recognition in early and late proficient and non-proficient Spanish–Catalan bilinguals. Language Learning, 56, 571–608. https://doi.org/10.1111/j.1467-9922.2006.00389.x
doi: 10.1111/j.1467-9922.2006.00389.x
Forster, K. I., & Forster, J. C. (2003). DMDX: A Windows display program with millisecond accuracy. Behavior Research Methods, Instruments, & Computers, 35, 116–124. https://doi.org/10.3758/BF03195503
doi: 10.3758/BF03195503
Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48, 963–972. https://doi.org/10.3758/s13428-015-0621-0
doi: 10.3758/s13428-015-0621-0
pubmed: 26170053
Guasch, M., Boada, R., Ferré, P., & Sánchez-Casas, R. (2013). NIM: A Web-based Swiss Army knife to select stimuli for psycholinguistic studies. Behavior Research Methods, 45, 765–771. https://doi.org/10.3758/s13428-012-0296-8
doi: 10.3758/s13428-012-0296-8
pubmed: 23271155
Guasch, M., Ferré, P., & Haro, J. (2017). Pupil dilation is sensitive to the cognate status of words: Further evidence for non-selectivity in bilingual lexical access. Bilingualism: Language and Cognition, 20, 49–54. https://doi.org/10.1017/S1366728916000651
doi: 10.1017/S1366728916000651
Guasch, M., Sánchez-Casas, R., Ferré, P., & García-Albea, J. E. (2008). Translation performance of beginning, intermediate and proficient Spanish–Catalan bilinguals: Effects of form and semantic relations. Mental Lexicon, 3, 289–308. https://doi.org/10.1075/ml.3.3.03gua
doi: 10.1075/ml.3.3.03gua
Guasch, M., Sánchez-Casas, R., Ferré, P., & García-Albea, J. E. (2011). Effects of the degree of meaning similarity on cross-language semantic priming in highly proficient bilinguals. Journal of Cognitive Psychology, 23, 942–961. https://doi.org/10.1080/20445911.2011.589382
doi: 10.1080/20445911.2011.589382
Howes, D. H., & Solomon, R. L. (1951). Visual duration threshold as a function of word-probability. Journal of Experimental Psychology, 41, 401–410. https://doi.org/10.1037/h0056020
doi: 10.1037/h0056020
pubmed: 14873866
Huang, X. (2017). The role of word frequency and contextual diversity in visual word recognition: A mini review. New Frontiers in Ophthalmology, 3, 1–4. https://doi.org/10.15761/NFO.1000185
doi: 10.15761/NFO.1000185
Imbs, P. (1971). Etudes statistiques sur le vocabulaire français. Dictionnaire des fréquences, Vocabulaire littéraire des XIX’ et XX’ siècles. Centre de la Recherche pour un Trésor de La Langue française (CNRS), Nancy, Paris, Librairie Marcel-Didier.
Institut d’Estudis Catalans. (1995). Diccionari de la llengua catalana. Barcelona, Spain: IEC.
Instituto Nacional de Estadística. (2011). Spanish Time Use Survey 2009–2010. Retrieved July 16, 2018, from http://www.ine.es/inebmenu/indiceAZ.htm
Kandel, S., Burfin, S., Méary, D., Ruiz-Tada, E., Costa, A., & Pascalis, O. (2016). The impact of early bilingualism on face recognition processes. Frontiers in Psychology, 7, 1080. https://doi.org/10.3389/fpsyg.2016.01080
doi: 10.3389/fpsyg.2016.01080
pubmed: 27486422
pmcid: 4949974
Keuleers, E. (2013). vwr: Useful functions for visual word recognition research (R package version 0.3.0). Retrieved from https://CRAN.R-project.org/package=vwr
Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods, 42, 643–650. https://doi.org/10.3758/BRM.42.3.643
doi: 10.3758/BRM.42.3.643
pubmed: 20805586
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304. https://doi.org/10.3758/s13428-011-0118-4
doi: 10.3758/s13428-011-0118-4
Kilgarriff, A. (2006). BNC database and word frequency lists. Retrieved Jul 16, 2018, from http://www.kilgarriff.co.uk/bnc-readme.html
Kučera, M., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.
Lee, C. J. (2003). Evidence-based selection of word frequency lists. Journal of Speech-Language Pathology and Audiology, 27, 172–175.
Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London, UK: Longman.
Mandera, P., Keuleers, E., Wodniecka, Z., & Brysbaert, M. (2015). Subtlex-pl: Subtitle-based word frequency estimates for Polish. Behavior Research Methods, 47, 471–483. https://doi.org/10.3758/s13428-014-0489-4
doi: 10.3758/s13428-014-0489-4
pubmed: 24942246
Martin, C. D., Strijkers, K., Santesteban, M., Escera, C., Hartsuiker, R. J., & Costa, A. (2013). The impact of early bilingualism on controlling a language learned late: An ERP study. Frontiers in Psychology, 4, 815. https://doi.org/10.3389/fpsyg.2013.00815
doi: 10.3389/fpsyg.2013.00815
pubmed: 24204355
pmcid: 3817381
Moldovan, C. D., Demestre, J., Ferré, P., & Sánchez-Casas, R. (2016). The role of meaning and form similarity in translation recognition in highly proficient balanced bilinguals: A behavioral and ERP study. Journal of Neurolinguistics, 37, 1–11. https://doi.org/10.1016/j.jneuroling.2015.07.002
doi: 10.1016/j.jneuroling.2015.07.002
New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied PsychoLinguistics, 28, 661–667. https://doi.org/10.1017/S014271640707035X
doi: 10.1017/S014271640707035X
New, B., Pallier, C., Brysbaert, M., & Ferrand, L. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, & Computers, 36, 516–524. https://doi.org/10.3758/BF03195598
doi: 10.3758/BF03195598
Perea, M., Soares, A. P., & Comesaña, M. (2013). Contextual diversity is a main determinant of word identification times in young readers. Journal of Experimental Child Psychology, 116, 37–44. https://doi.org/10.1016/j.jecp.2012.10.014
doi: 10.1016/j.jecp.2012.10.014
pubmed: 23374607
Plummer, P., Perea, M., & Rayner, K. (2014). The influence of contextual diversity on eye movements in reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40, 275–283. https://doi.org/10.1037/a0034058
doi: 10.1037/a0034058
pubmed: 23937235
Rabovsky, M., Sommer, W., & Abdel Rahman, R. A. (2012). The time course of semantic richness effects in visual word recognition. Frontiers in Human Neuroscience, 6, 11. https://doi.org/10.3389/fnhum.2012.00011
doi: 10.3389/fnhum.2012.00011
pubmed: 22347855
pmcid: 3278705
Rafel, J. (1998). Diccionari de freqüències. Barcelona, Spain: Institut d’Estudis Catalans.
Rodríguez-Pujadas, A., Sanjuán, A., Ventura-Campos, N., Román, P., Martin, C., Barceló, F., . . . Ávila, C. (2013). Bilinguals use language-control brain areas more than monolinguals to perform non-linguistic switching tasks. PLoS ONE, 8, e73028. https://doi.org/10.1371/journal.pone.0073028
doi: 10.1371/journal.pone.0073028
Rosa, E., Tapia, J. L., & Perea, M. (2017). Contextual diversity facilitates learning new words in the classroom. PLoS ONE, 12, e0179004. https://doi.org/10.1371/journal.pone.0179004
doi: 10.1371/journal.pone.0179004
pubmed: 28586354
pmcid: 5460874
Sebastián-Gallés, N., Martí, M. A., Carreiras, M., & Cuetos, F. (2000). LEXESP: Una base de datos informatizada del español. Barcelona, Spain: Universitat de Barcelona.
Shaoul, C., & Westbury, C. (2013). A reduced redundancy USENET corpus (2005–2011). Edmonton, AB: University of Alberta. Retrieved July 18, 2018, from www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Simons, G. F., & Fennig, C. D. (2018). Ethnologue: Languages of the world (21st ed.). Dallas, TX: SIL International. Retrieved July 16, 2018, from http://www.ethnologue.com
Sinclair, J. (2005). Corpus and text: Basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford, UK: Oxbow.
Soares, A. P., Machado, J., Costa, A., Iriarte, Á., Simões, A., de Almeida, J. J., . . . Perea, M. (2015). On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese. Quarterly Journal of Experimental Psychology, 68, 680–696. https://doi.org/10.1080/17470218.2014.964271
doi: 10.1080/17470218.2014.964271
Tang, K. (2012). A 61 million word corpus of Brazilian Portuguese film subtitles as a resource for linguistic research. UCL Working Papers in Linguistics, 24, 208–214.
van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176–1190. https://doi.org/10.1080/17470218.2013.850521
doi: 10.1080/17470218.2013.850521
Vergara-Martínez, M., Comesaña, M., & Perea, M. (2017). The ERP signature of the contextual diversity effect in visual word recognition. Cognitive, Affective, & Behavioral Neuroscience, 17, 461–474. https://doi.org/10.3758/s13415-016-0491-7
doi: 10.3758/s13415-016-0491-7
Yap, M. J., & Balota, D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60, 502–529. https://doi.org/10.1016/j.jml.2009.02.001
doi: 10.1016/j.jml.2009.02.001
Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15, 971–979. https://doi.org/10.3758/PBR.15.5.971
doi: 10.3758/PBR.15.5.971
Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995). The educator’s word frequency guide. Brewster, NY: Touchstone Applied Science.
Zipf, G. K. (1949). Human behaviour and the principle of least effort. Cambridge, MA: Addison-Wesley.