...
首页> 外文期刊>Data in Brief >Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
【24h】

Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

机译:用于单词嵌入的大型实验调查的可再现性数据集,以及基于本体的单词相似性方法

获取原文
           

摘要

This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et?al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et?al., 2017), and their subsequent recording with Reprozip (Chirigati et?al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et?al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.
机译:本数据文章介绍了可重复性数据集,目的是允许精确复制我们在同伴论文中介绍的所有实验,结果和数据表(Lastra-Díaz等人,2019),其中介绍了最大的基于本体的实验调查文献中报道了语义相似性方法和单词嵌入的词嵌入(WE)。我们所有实验的实施以及从它们衍生的所有原始数据的收集均基于软件实现和HESML库中所有方法的评估(Lastra-Díaz等人,2017),以及它们的后续记录使用Reprozip(Chirigati et al。,2016)。原始数据由一组数据文件组成,这些文件收集在任何基准中评估的每个单词对的每种方法返回的原始单词相似性值。原始数据文件通过运行R语言脚本进行处理,目的是计算(Lastra-Díaz等人,2019)中报告的所有评估指标,例如Pearson和Spearman相关性,谐波得分和统计显着性p值,以及自动生成我们随行文件中显示的所有数据表。我们的数据集提供了所有输入数据文件,资源和辅助软件工具,以从头开始复制所有实验数据,统计分析和报告数据。最后,我们的可重复性数据集提供了一个独立的实验平台,该平台允许通过设置包括其他未考虑的方法或词相似性基准的新实验来运行新词相似性基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号