The Children's Picture Books Lexicon (CPB-LEX): A large-scale lexical database from children's picture books.

Age of acquisition Child input norms Early print exposure Lexical database Picture books

Journal

Behavior research methods
ISSN: 1554-3528
Titre abrégé: Behav Res Methods
Pays: United States
ID NLM: 101244316

Informations de publication

Date de publication:
11 Aug 2023
Historique:
accepted: 10 07 2023
medline: 11 8 2023
pubmed: 11 8 2023
entrez: 11 8 2023
Statut: aheadofprint

Résumé

This article presents CPB-LEX, a large-scale database of lexical statistics derived from children's picture books (age range 0-8 years). Such a database is essential for research in psychology, education and computational modelling, where rich details on the vocabulary of early print exposure are required. CPB-LEX was built through an innovative method of computationally extracting lexical information from automatic speech-to-text captions and subtitle tracks generated from social media channels dedicated to reading picture books aloud. It consists of approximately 25,585 types (wordforms) and their frequency norms (raw and Zipf-transformed), a lexicon of bigrams (two-word sequences and their transitional probabilities) and a document-term matrix (which shows the importance of each word in the corpus in each book). Several immediate contributions of CPB-LEX to behavioural science research are reported, including that the new CPB-LEX frequency norms strongly predict age of acquisition and outperform comparable child-input lexical databases. The database allows researchers and practitioners to extract lexical statistics for high-frequency words which can be used to develop word lists. The paper concludes with an investigation of how CPB-LEX can be used to extend recent modelling research on the lexical diversity children receive from picture books in addition to child-directed speech. Our model shows that the vocabulary input from a relatively small number of picture books can dramatically enrich vocabulary exposure from child-directed speech and potentially assist children with vocabulary input deficits. The database is freely available from the Open Science Framework repository: https://tinyurl.com/4este73c .

Identifiants

pubmed: 37566336
doi: 10.3758/s13428-023-02198-y
pii: 10.3758/s13428-023-02198-y
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© 2023. Crown.

Références

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445–459.
pubmed: 17958156 doi: 10.3758/BF03193014
Bialystok, E., Luk, G., Peets, K. F., & Sujin, Y. A. N. G. (2010). Receptive vocabulary differences in monolingual and bilingual children. Bilingualism: Language and Cognition, 13(4), 525–531.
pubmed: 25750580 doi: 10.1017/S1366728909990423
Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge University Press.
doi: 10.1017/9781316410899
Brezina, V., Platt, W. (2023). #LancsBox X 2.0 [software package]. https://lancsbox.lancaster.ac.uk/
Brysbaert, M. (2019). How many words do we read per minute? A review and meta-analysis of reading rate. Journal of Memory and Language, 109, 104047.
doi: 10.1016/j.jml.2019.104047
Brysbaert, M., & Biemiller, A. (2017). Test-based age-of-acquisition norms for 44 thousand English word meanings. Behavior Research Methods, 49(4), 1520–1523.
pubmed: 27659480 doi: 10.3758/s13428-016-0811-4
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
pubmed: 19897807 doi: 10.3758/BRM.41.4.977
Brysbaert, M., Mandera, P., McCormick, S. F., & Keuleers, E. (2019). Word prevalence norms for 62,000 English lemmas. Behavior Research Methods, 51(2), 467–479.
pubmed: 29967979 doi: 10.3758/s13428-018-1077-9
Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44(4), 991–997.
pubmed: 22396136 doi: 10.3758/s13428-012-0190-4
Bus, A. G., van Ijzendoorn, M. H., & Pellegrini, A. D. (1995). Joint book reading makes for success in learning to read: A meta-analysis on intergenerational transmission of literacy. Review of Educational Research, 65, 1–21.
doi: 10.3102/00346543065001001
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS one, 5(6), e10729.
Castles, A., Rastle, K., & Nation, K. (2018). Ending the Reading wars: Reading acquisition from novice to expert. Psychological Science in the Public Interest, 19(1), 5–51.
pubmed: 29890888 doi: 10.1177/1529100618772271
Carroll, J. B., Davies, P., Richman, B., & Davies, P. (1971). The American Heritage word frequency book (pp. xxi–xl). Boston: Houghton Mifflin.
Carroll, J. B. (1971). Behind the scenes in the making of a corpus-based dictionary and a word frequency book (pp. 22–27). Paper presented at the meeting of the National Council of Teachers of English.
Carroll, J. B. (1972). A new word frequency book. Elementary English, 49(7), 1070–1074.
Corral, S., Ferrero, M., & Goikoetxea, E. (2009). LEXIN: A lexical database from Spanish kindergarten and first-grade readers. Behavior Research Methods, 41(4), 1009–1017.
pubmed: 19897809 doi: 10.3758/BRM.41.4.1009
Dawson, N., Hsiao, Y., Banerji, N., Tan, A. W. M., & Nation, K. (2021). Features of lexical richness in children’s books: Comparisons with child-directed speech. Language Development Research, 1(1), 9–48.
De Varda, A., & Marelli, M. (2022). The Effects of Surprisal across Languages: Results from Native and Non-native Reading. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022 (pp. 138–144).
Erbeli, F., & Rice, M. (2022). Examining the effects of silent independent reading on reading outcomes: A narrative synthesis review from 2000 to 2020. Reading & Writing Quarterly, 38(3), 253–271.
doi: 10.1080/10573569.2021.1944830
Evans, M. A., Williamson, K., & Pursoo, T. (2008). Preschoolers’ attention to print during shared book reading. Scientific Studies of Reading, 12(1), 106–129.
doi: 10.1080/10888430701773884
Farrell, L., Osenga, T., & Hunter, M. (2013). Comparing the Dolch and Fry high frequency word lists. Readsters, LLC.
Green, C., & McLachlan, C. (2023). Vocabulary Acquisition in Early Education: From Oral language to emergent academic literacy. E. Rata (Ed) the research handbook on curricula and education.
Gries, S. T. (2019). Analysing dispersion. In M. Paquot & S. T. Gries (Eds.), Practical handbook of corpus linguistics (pp. 1–16). Springer.
Hart, B., & Risley, T. R. (2003). The early catastrophe: The 30 million word gap by age 3. American Educator, 27(1), 4–9.
Hayes, D. P., & Ahrens, M. (1988). Vocabulary simplification for children: A special case of “motherese.” Child Language, 15, 135–169.
Heath, S. B. (1982). What no bedtime story means: Narrative skills at home and school. Language in Society, 11(1), 49–76.
doi: 10.1017/S0047404500009039
Krashen, S. D. (2004). The power of reading: Insights from the research: Insights from the research. ABC-CLIO.
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British lexicon project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304.
pubmed: 21720920 doi: 10.3758/s13428-011-0118-4
Kucera, H., & Francis, W. (1967). Computational analysis of present-day American English. Brown University Press.
Kuhn, M. (2010). The Caret Package Homepage. Retrieved from: http://caret.r-forge.r-project.org/ .
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990.
pubmed: 22581493 doi: 10.3758/s13428-012-0210-4
Lété, B., Sprenger-Charolles, L., & Colé, P. (2004). MANULEX: A grade-level lexical database from French elementary school readers. Behavior Research Methods, Instruments, & Computers, 36(1), 156–166.
doi: 10.3758/BF03195560
Levy, R. (2013). Memory and surprisal in human sentence comprehension. In R. P. G. van Gompel (Ed.), Sentence processing (pp. 78–114). Psychology Press.
Li, L., Yang, Y., Song, M., Fang, S., Zhang, M., Chen, Q., & Cai, Q. (2022). CCLOWW: A grade-level Chinese children’s lexicon of written words. Behavior Research Methods, 1–16.
Logan, J. A., Justice, L. M., Yumus, M., & Chaparro-Moreno, L. J. (2019). When children are not read to at home: The million word gap. Journal of Developmental & Behavioral Pediatrics, 40(5), 383–386.
doi: 10.1097/DBP.0000000000000657
MacWhinney, B. (2001). From CHILDES to TalkBank. In B. MacWhinney, M. Almgren, A. Barreña, M. Ezeizaberrena, & I. Idiazabal (Eds.), Research in child language acquisition (pp. 17–34). Cascadilla Press.
Masterson, J., Stuart, M., Dixon, M., & Lovejoy, S. (2010). Children's printed word database: Continuities and changes over time in children's early reading vocabulary. British Journal of Psychology, 101(2), 221–242.
pubmed: 20021708 doi: 10.1348/000712608X371744
Matulka, D. I. (2008). A picture book primer: Understanding and using picture books. Greenwood Publishing Group.
Millett, P. (2021). Accuracy of speech-to-text captioning for students who are deaf or hard of hearing. Journal of educational, pediatric & (re) Habilitative. Audiology, 25.
Mol, S. E., & Bus, A. G. (2011). To read or not to read: A meta-analysis of print exposure from infancy to early adulthood. Psychological Bulletin, 137, 267–296.
pubmed: 21219054 doi: 10.1037/a0021890
Montag, J. L., Jones, M. N., & Smith, L. B. (2015). The words children hear: Picture books and the statistics for language learning. Psychological Science, 26(9), 1489–1496.
pubmed: 26243292 doi: 10.1177/0956797615594361
Montag, J. L., Jones, M. N., & Smith, L. B. (2018). Quantity and diversity: Simulating early word learning environments. Cognitive Science, 42, 375–412.
pubmed: 29411899 pmcid: 5980672 doi: 10.1111/cogs.12592
Moya-Guijarro, A. J. (2016). A multimodal analysis of picture books for children: A systemic functional approach.
McQuillan, J., & Krashen, S. D. (2008). Commentary: Can free reading take you all the way? A response to cobb (2007). Language Learning & Technology, 12(1), 104–108.
O’Brien, B. A., Ng, S. C., & Arshad, N. A. (2020). The structure of home literacy environment and its relation to emergent English literacy skills in the multilingual context of Singapore. Early Childhood Research Quarterly, 53, 441–452.
doi: 10.1016/j.ecresq.2020.05.014
Rowe, M. L. (2012). A longitudinal investigation of the role of quantity and quality of child-directed speech in vocabulary development. Child Development, 83(5), 1762–1774.
pubmed: 22716950 pmcid: 3440540 doi: 10.1111/j.1467-8624.2012.01805.x
Schroeder, S., Würzner, K. M., Heister, J., Geyken, A., & Kliegl, R. (2015). ChildLex: A lexical database of German read by children. Behavior Research Methods, 47(4), 1085–1094.
pubmed: 25319039 doi: 10.3758/s13428-014-0528-1
Soares, A. P., Medeiros, J. C., Simões, A., Machado, J., Costa, A., Iriarte, Á., & Comesaña, M. (2014). ESCOLEX: A grade-level lexical database from European Portuguese elementary to middle school textbooks. Behavior Research Methods, 46(1), 240–253.
pubmed: 23709164 doi: 10.3758/s13428-013-0350-1
Sampson, G. (2002). Empirical linguistics. A&C Black.
Spanos, G., & Smith, J. (1990). Closed captioned television for adult LEP literacy learners. ERIC digest. Washington, DC: National Clearinghouse.
Stuart, M., Dixon, M., Masterson, J., & Gray, B. (2003). Children's early reading vocabulary: Description and word frequency lists. British Journal of Educational Psychology, 73(4), 585–598.
pubmed: 14713379 doi: 10.1348/000709903322591253
Sun, H., Steinkrauss, R., Tendeiro, J., & de Bot, K. (2016). Individual differences in very young children’s English acquisition in China: Internal and external factors. Bilingualism: Language and Cognition, 19(3), 550–566. https://doi.org/10.1017/S1366728915000243
doi: 10.1017/S1366728915000243
Sun, H., Loh, J. Y., & Roberts, A. C. (2019). Motion and sound in animated storybooks for preschooler’s total fixation time and mandarin language learning: An eye-tracking study with Singaporean bilingual children. AERA Open, 5(2), 1–19. https://doi.org/10.1177/2332858419848431
doi: 10.1177/2332858419848431
Sun, H., & Yin, B. (2020). Vocabulary development in early language education. In M. Schwartz (Ed.), International handbook on early language education (pp. 1–26). Springer. https://doi.org/10.1007/978-3-030-47073-9_3-1
doi: 10.1007/978-3-030-47073-9_3-1
Sun, H., Toh, W. M., & Steinkrauss, R. (2020). Instructional strategies and linguistic features of kindergarten teachers’ shared book reading: The case of Singapore. Applied PsychoLinguistics, 41(2), 427–456. https://doi.org/10.1017/S0142716420000053
doi: 10.1017/S0142716420000053
Sun, H., & Ng, E. (2021). Home and school factors in early English language education. Asia Pacific Journal of Education, 41(4), 657–672. https://doi.org/10.1080/02188791.2021.1932742
doi: 10.1080/02188791.2021.1932742
Sun, H., Roberts, A. C., & Bus, A. (2022). Bilingual children’s visual attention while reading digital picture books and story retelling. Journal of Experimental Child Psychology, 215, 105327. https://doi.org/10.1016/j.jecp.2021.105327
doi: 10.1016/j.jecp.2021.105327 pubmed: 34894472
Terzopoulos, A. R., Duncan, L. G., Wilson, M. A., Niolaki, G. Z., & Masterson, J. (2017). HelexKids: A word frequency database for Greek and Cypriot primary school children. Behavior Research Methods, 49, 83–96.
pubmed: 26822666 doi: 10.3758/s13428-015-0698-5
Torgeson, J. K., Wagner, R. K., & Rashotte, C. A. (1999). Test review: Test of word Reading efficiency (TOWRE). Pro-ed.
Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
doi: 10.1080/17470218.2013.850521
Wasik, B. A., Hindman, A. H., & Snell, E. K. (2016). Book reading and vocabulary development: A systematic review. Early Childhood Research Quarterly, 37, 39–57.
doi: 10.1016/j.ecresq.2016.04.003
Wild, K., Kilgarriff, A., & Tugwell, D. (2013). The Oxford Children’s corpus: Using a children’s corpus in lexicography. International Journal of Lexicography, 26(2), 190–218.
doi: 10.1093/ijl/ecs017
Yang, J. S., Rosvold, C., & Bernstein Ratner, N. (2022). Measurement of lexical diversity in children’s spoken language: Computational and conceptual considerations. Frontiers in Psychology, 13, 3350.
Zeno, S., Ivens, S., Millard, R., & Duvvuri, R. (1995). The educator’s word frequency guide. Touchstone applied science associates (TASA).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
doi: 10.1111/j.1467-9868.2005.00503.x

Auteurs

Clarence Green (C)

Faculty of Education, University of Hong Kong, Pok Fu Lam, Hong Kong.

Kathleen Keogh (K)

Senior Lecturer, Centre for Smart Analytics & Institute of Innovation, Science and Sustainability, Federation University Australia, Mount Helen, Australia. k.keogh@federation.edu.au.

He Sun (H)

Centre for Research in Child Language, National Institute of Education, Nanyang Technological University, Singapore, Singapore.

Beth O'Brien (B)

Centre for Research in Child Development, National Institute of Education, Nanyang Technological University, Singapore, Singapore.

Classifications MeSH