Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity.

Experimental survey HESML Information content models Ontology-based semantic similarity measures Reprozip Word embedding models WordNet

Journal

Data in brief
ISSN: 2352-3409
Titre abrégé: Data Brief
Pays: Netherlands
ID NLM: 101654995

Informations de publication

Date de publication:
Oct 2019
Historique:
received: 28 07 2019
revised: 11 08 2019
accepted: 16 08 2019
entrez: 14 9 2019
pubmed: 14 9 2019
medline: 14 9 2019
Statut: epublish

Résumé

This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.

Identifiants

pubmed: 31516953
doi: 10.1016/j.dib.2019.104432
pii: S2352-3409(19)30787-5
pii: 104432
pmc: PMC6736772
doi:

Types de publication

Journal Article

Langues

eng

Pagination

104432

Références

Data Brief. 2019 Aug 26;26:104432
pubmed: 31516953

Auteurs

Juan J Lastra-Díaz (JJ)

NLP & IR Research Group, ETSI de Informática (UNED), Universidad Nacional de Educación a Distancia, Juan Del Rosal 16, 28040, Madrid, Spain.

Josu Goikoetxea (J)

IXA NLP Group, Faculty of Informatics, UPV/EHU∖∖ Manuel Lardizabal 1, 20018, Donostia, Basque Country, Spain.

Mohamed Ali Hadj Taieb (MA)

Faculty of Sciences of Sfax, Tunisia.

Ana García-Serrano (A)

NLP & IR Research Group, ETSI de Informática (UNED), Universidad Nacional de Educación a Distancia, Juan Del Rosal 16, 28040, Madrid, Spain.

Mohamed Ben Aouicha (MB)

Faculty of Sciences of Sfax, Tunisia.

Eneko Agirre (E)

IXA NLP Group, Faculty of Informatics, UPV/EHU∖∖ Manuel Lardizabal 1, 20018, Donostia, Basque Country, Spain.

Classifications MeSH