A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Jensen–Shannon divergence Project Gutenberg natural language processing quantitative linguistics reproducibility

Journal

Entropy (Basel, Switzerland)
ISSN: 1099-4300
Titre abrégé: Entropy (Basel)
Pays: Switzerland
ID NLM: 101243874

Informations de publication

Date de publication:
20 Jan 2020
Historique:
received: 29 11 2019
revised: 15 01 2020
accepted: 16 01 2020
entrez: 8 12 2020
pubmed: 9 12 2020
medline: 9 12 2020
Statut: epublish

Résumé

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Identifiants

pubmed: 33285901
pii: e22010126
doi: 10.3390/e22010126
pmc: PMC7516435
pii:
doi:

Types de publication

Journal Article

Langues

eng

Références

Science. 2011 Apr 1;332(6025):35; author reply 36-7
pubmed: 21454771
PLoS Med. 2005 Aug;2(8):e124
pubmed: 16060722
Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):8260-8265
pubmed: 30072428
Phys Rev E Stat Nonlin Soft Matter Phys. 2002 Apr;65(4 Pt 1):041905
pubmed: 12005871
R Soc Open Sci. 2017 Nov 8;4(11):170830
pubmed: 29291074
Proc Natl Acad Sci U S A. 2007 Jan 30;104(5):1461-4
pubmed: 17244704
PLoS One. 2015 Oct 07;10(10):e0137041
pubmed: 26445406
Sci Rep. 2014 Jul 31;4:5890
pubmed: 25080941
Science. 2011 Jan 14;331(6014):176-82
pubmed: 21163965
PLoS One. 2016 Jan 22;11(1):e0147073
pubmed: 26800025
Proc Natl Acad Sci U S A. 2012 Jul 17;109(29):11582-7
pubmed: 22753514
Science. 2015 Aug 28;349(6251):aac4716
pubmed: 26315443
PLoS One. 2018 May 25;13(5):e0197741
pubmed: 29799872
PLoS One. 2015 Apr 07;10(4):e0121898
pubmed: 25849150
Nat Hum Behav. 2018 Sep;2(9):637-644
pubmed: 31346273
Proc Natl Acad Sci U S A. 2003 Feb 4;100(3):788-91
pubmed: 12540826
Phys Rev Lett. 2015 Jun 12;114(23):238701
pubmed: 26196834
Phys Rev E Stat Nonlin Soft Matter Phys. 2004 May;69(5 Pt 1):051915
pubmed: 15244855
Phys Rev E Stat Nonlin Soft Matter Phys. 2015 May;91(5):052811
pubmed: 26066216
Proc Natl Acad Sci U S A. 2012 May 15;109(20):7682-6
pubmed: 22547796
Proc Biol Sci. 2001 Dec 22;268(1485):2603-6
pubmed: 11749717
PLoS One. 2011;6(12):e26752
pubmed: 22163266
IEEE Trans Neural Netw Learn Syst. 2019 Nov;30(11):3326-3337
pubmed: 30951479
PLoS One. 2011 Feb 28;6(2):e17333
pubmed: 21407801
J R Soc Interface. 2014 Dec 6;11(101):20141044
pubmed: 25339692
J R Soc Interface. 2014 Dec 6;11(101):20140841
pubmed: 25274040
Chaos. 1996 Sep;6(3):414-427
pubmed: 12780271
Sci Rep. 2012;2:943
pubmed: 23230508
PLoS One. 2012;7(11):e48386
pubmed: 23189130
Proc Natl Acad Sci U S A. 2015 Apr 28;112(17):5348-53
pubmed: 25870294
Phys Life Rev. 2014 Dec;11(4):598-618
pubmed: 24794524

Auteurs

Martin Gerlach (M)

Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA.

Francesc Font-Clos (F)

Center for Complexity and Biosystems, Department of Physics, University of Milan, 20133 Milano, Italy.

Classifications MeSH