首页> 外文期刊>Journal of Biomedical Discovery and Collaboration >Corpus Refactoring: a Feasibility Study
【24h】

Corpus Refactoring: a Feasibility Study

机译:语料库重构:可行性研究

获取原文
       

摘要

Background Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring – changing the format of a corpus without altering its semantics – is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps. Results The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented. Conclusion We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.
机译:背景技术大多数生物医学语料库并未在创建它们的实验室之外使用,尽管事实上,他们提供的金标准评估数据的可用性是生物医学文本挖掘进度的限速因素之一。数据表明,影响语料库在其家庭实验室之外使用的一个主要因素是其分发的格式。本文测试了一种假设,即语料库重构(更改语料库的格式而不改变其语义)是一个可行的目标,即可以通过半自动化的过程并以省时的方式完成。我们使用简单的文本处理方法和有限的人工验证将Protein Design Group的语料库转换为两种新格式:WordFreak和嵌入式XML。我们跟踪了花费的总时间以及自动化步骤的成功率。结果重构的语料库可从BioNLP SourceForge网站http://bionlp.sourceforge.net下载。花费的总时间仅为三个人/周,包括大约102个小时的编程时间(其中大部分是一次性开发成本)和20个小时的自动输出手动验证。此外,介绍了重构任何语料库所需的步骤。结论我们得出的结论是,重构公开语料库是一种技术和经济上可行的方法,可用于增加已用于评估生物医学语言处理系统的数据的使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号