Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

Juan J. Lastra-Díaz; Josu Goikoetxea; Mohamed Ali Hadj Taieb; Ana García-Serrano; Mohamed Ben Aouicha; Eneko Agirre

首页> 外文期刊>Data in Brief >Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

【24h】

Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

机译：用于单词嵌入的大型实验调查的可再现性数据集，以及基于本体的单词相似性方法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et?al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et?al., 2017), and their subsequent recording with Reprozip (Chirigati et?al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et?al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.

机译：本数据文章介绍了可重复性数据集，目的是允许精确复制我们在同伴论文中介绍的所有实验，结果和数据表（Lastra-Díaz等人，2019），其中介绍了最大的基于本体的实验调查文献中报道了语义相似性方法和单词嵌入的词嵌入（WE）。我们所有实验的实施以及从它们衍生的所有原始数据的收集均基于软件实现和HESML库中所有方法的评估（Lastra-Díaz等人，2017），以及它们的后续记录使用Reprozip（Chirigati et al。，2016）。原始数据由一组数据文件组成，这些文件收集在任何基准中评估的每个单词对的每种方法返回的原始单词相似性值。原始数据文件通过运行R语言脚本进行处理，目的是计算（Lastra-Díaz等人，2019）中报告的所有评估指标，例如Pearson和Spearman相关性，谐波得分和统计显着性p值，以及自动生成我们随行文件中显示的所有数据表。我们的数据集提供了所有输入数据文件，资源和辅助软件工具，以从头开始复制所有实验数据，统计分析和报告数据。最后，我们的可重复性数据集提供了一个独立的实验平台，该平台允许通过设置包括其他未考虑的方法或词相似性基准的新实验来运行新词相似性基准。

著录项

来源
《Data in Brief》 |2019年第1期|共9页
作者
Juan J. Lastra-Díaz; Josu Goikoetxea; Mohamed Ali Hadj Taieb; Ana García-Serrano; Mohamed Ben Aouicha; Eneko Agirre;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Ontology-based semantic similarity measuresWord embedding modelsInformation content modelsWordNetExperimental surveyHESMLReprozip;

机译：基于本体的语义相似度度量词嵌入模型信息内容模型WordNet实验调查HESMLReprozip;

相似文献

外文文献
中文文献
专利

1. A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art [J] . Lastra-Diaz Juan J., Goikoetxea Josu, Taieb Mohamed Ali Hadj, Engineering Applications of Artificial Intelligence . 2019,第Octa期

机译：有关单词嵌入和基于本体的单词相似性方法的可重复性调查：线性组合的性能超越了现有技术
2. A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art [J] . Lastra-Diaz Juan J., Goikoetxea Josu, Taieb Mohamed Ali Hadj, Engineering Applications of Artificial Intelligence . 2019,第OCTa期

机译：有关单词嵌入和基于本体的单词相似性方法的可重复性调查：线性组合的性能超越了现有技术
3. A large reproducible benchmark of ontology-based methods and word embeddings for word similarity [J] . Lastra-Diaz Juan J., Goikoetxea Josu, Taieb Mohamed Ali Hadj, Information Systems . 2021,第Feba期

机译：基于本体的方法和单词嵌入式的大型可重复性基准，用于单词相似性
4. A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets [C] . Yu Wan, Yidong Chen, Xiaodong Shi, UK Workshop on Computational Intelligence . 2018

机译：用于构建和验证单词相似性数据集的多学科方法
5. Improved GloVe Word Embedding Using Linear Weighting Scheme for Word Similarity Tasks [D] . Lu, Qinglan. 2021

机译：使用线性加权方案进行改进的手套单词嵌入单词相似性任务
6. Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity [O] . Juan J. Lastra-Díaz, Josu Goikoetxea, Mohamed Ali Hadj Taieb, 2019

机译：用于单词嵌入的大型实验调查的可重复性数据集以及基于本体的单词相似性方法
7. Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity [O] . Juan J. Lastra-Díaz, Josu Goikoetxea, Mohamed Ali Hadj Taieb, 2019

机译：用于Word eMbeddings的大型实验调查的再现性数据集和基于本体的词汇方法

Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

摘要

著录项

相似文献

相关主题

期刊订阅