...
首页> 外文期刊>PLoS One >Protocol for a reproducible experimental survey on biomedical sentence similarity
【24h】

Protocol for a reproducible experimental survey on biomedical sentence similarity

机译:生物医学句子相似性再现实验调查的协议

获取原文
           

摘要

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
机译:测量句子之间的语义相似性是自然语言处理(NLP),信息检索(IR)和生物医学文本挖掘领域的重要任务。因此,近年来,生物医学领域的句子相似性方法的提议引起了很多关注。但是,在生物医学域中报告的大多数句子相似性方法和实验结果不能被重现,因为如下:在没有确认的情况下复制先前的结果,缺乏源代码和数据来复制方法和实验,以及缺乏缺陷实验设置的详细定义,等。由于这种再现性差距来说,问题的状态既不阐明也不是新的研究线条。另一方面,在生物医学句子中的文献中存在其他重要的差距,如下所示:(1)评估几种不开发的句子相似性方法,该方法应该得到研究; (2)评估生物医学句子相似性的未开发基准,称为语料库转录 - 调节(CTR); (3)研究预处理阶段和命名实体识别(NER)工具对句子相似性的性能的影响;最后,(4)缺乏软件和数据资源,用于在这一研究线中的方法和实验的再现性。该挂号报告识别出这些公开问题,介绍了详细的实验设置,以及文献的分类,开发最大,更新,以及第一次可重复的生物医学句子相似实验调查。我们上述实验调查将基于我们自己的软件复制,并在同一软件平台上研究所有方法的评估,这将专门为此作品开发,并且它将成为生物医学句子相似性的第一个公开的软件库。最后,我们将提供一个非常详细的再现性协议和数据集作为补充材料,以便允许所有实验和结果的确切复制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号