...
首页> 外文期刊>Database >Web services-based text-mining demonstrates broad impacts for interoperability and process simplification
【24h】

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

机译:基于Web服务的文本挖掘展示了对互操作性和流程简化的广泛影响

获取原文
   

获取外文期刊封面封底 >>

       

摘要

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/
机译:生物学中的信息提取系统的关键评估(BioCreAtIvE)挑战评估任务共同代表了整个社区的努力,以评估应用于生物学领域的各种文本挖掘和信息提取系统。 BioCreative IV讲习班包括五个独立的主题领域,包括Track 3,其重点是比较毒物基因组学数据库(CTD; http://ctdbase.org)的命名实体识别(NER)。此前,CTD为BioCreative Workshop 2012组织了文件排名和与NER相关的任务;这项工作的主要发现是,互操作性和集成复杂性是将系统直接应用到CTD的文本挖掘管道的主要障碍。这突出了软件集成工作中的一个普遍问题。与互操作性相关的主要问题包括流程模块化,操作系统不兼容,工具配置复杂性以及高级流程间通信缺乏标准化。一种可能减轻互操作性和一般集成问题的方法是使用Web服务来抽象实现细节。无需直接集成NER工具,而是可以使用BioC(XML格式的新兴家族)从CTD的异步,面向批处理的文本挖掘管道的基于HTTP的调用到远程NER Web服务,以识别特定的生物学术语。通讯。为了检验这一概念,与会小组开发了适合于CTD的NER要求的符合代表性状态转移/ BioC的Web服务。为参与者提供了一套全面的培训材料。 CTD根据510种人工策划的科学文章的测试数据集评估了从基于远程Web服务的URL获得的结果。十二个小组参加了挑战。召回率,精确度,平衡的F分数和响应时间被计算。基因,化学和疾病NER的最高平衡F分数分别为61%,74%和51%。每篇文章的响应时间从几分之一秒到一分钟以上不等。我们提供了挑战的描述和结果摘要,展示了策展组如何有效使用可互操作的NER技术来简化文本挖掘管道的实现。数据库网址:http://ctdbase.org/

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号