...
首页> 外文期刊>Journal of Data and Information Science >Data-driven Discovery: A New Era of Exploiting the Literature and Data
【24h】

Data-driven Discovery: A New Era of Exploiting the Literature and Data

机译:数据驱动的发现:利用文献和数据的新时代

获取原文

摘要

In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable hypothesis is well on its way of becoming obsolete within just a year or two. Analyzing the literature and data to automatically generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets. Here, viewpoints are provided and discussed to help the understanding of challenges of data-driven discovery.The Panama Canal, the 77-kilometer waterway connecting the Atlantic and Pacific oceans, has played a crucial role in international trade for more than a century. However, digging the Panama Canal was an exceedingly challenging process. A French effort in the late 19th century was abandoned because of equipment issues and a significant loss of labor due to tropical diseases transmitted by mosquitoes. The United States officially took control of the project in 1902. The United States replaced the unusable French equipment with new construction equipment that was designed for a much larger and faster scale of work. Colonel William C. Gorgas was appointed as the chief sanitation officer and charged with eliminating mosquito-spread illnesses. After overcoming these and additional trials and tribulations, the Canal successfully opened on August 15, 1914. The triumphant completion of the Panama Canal demonstrates that using the right tools and eliminating significant threats are critical steps in any project.More than 100 years later, a paradigm shift is occurring, as we move into a data-centered era. Today, data are extremely rich but overwhelming, and extracting information out of data requires not only the right tools and methods but also awareness of major threats. In this data-intensive era, the traditional method of exploring the related publications and available datasets from previous experiments to arrive at a testable hypothesis is becoming obsolete. Consider the fact that a new article is published every 30 seconds (Jinha, 2010). In fact, for the common disease of diabetes, there have been roughly 500,000 articles published to date; even if a scientist reads 20 papers per day, he will need 68 years to wade through all the material. The standard method simply cannot sufficiently deal with the large volume of documents or the exponential growth of datasets. A major threat is that the canon of domain knowledge cannot be consumed and held in human memory. Without efficient methods to process information and without a way to eliminate the fundamental threat of limited memory and time to handle the data deluge, we may find ourselves facing failure as the French did on the Isthmus of Panama more than a century ago.Scouring the literature and data to generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets (Evans & Foster, 2011). In reality, most scholars have never been able to keep completely up-to-date with publications and datasets considering the unending increase in quantity and diversity of research within their own areas of focus, let alone in related conceptual areas in which knowledge may be segregated by syntactically impenetrable keyword barriers or an entirely different research corpus.Research communities in many disciplines are finally recognizing that with advances in information technology there needs to be new ways to extract entities from increasingly data-intensive publications and to integrate and analyze large-scale datasets. This provides a compelling opportunity to improve the process of knowledge discovery from the literature and datasets through use of knowledge graphs and an associated framework that integrates scholars, domain knowledge, datasets, workflows, and machines on a scale previously beyond our reach (Ding et al., 2013).
机译:在当前数据密集型时代,通过探索相关出版物以产生可检验的假设进行科学研究的传统动手方法正好在短短的一两年之内就已经过时了。分析文献和数据以自动生成假设可能会成为事实方法,为那些试图掌握出版物和数据集的指数级快速扩展的人们提供核心的研究信息。在这里,提供并讨论了观点,以帮助理解数据驱动发现的挑战。长达一百七十多年的巴拿马运河(连接大西洋和太平洋的77公里水路)在国际贸易中发挥了至关重要的作用。但是,挖掘巴拿马运河是一个极其艰巨的过程。由于设备问题以及由于蚊子传播的热带病而导致的大量劳动力流失,法国在19世纪末放弃了努力。美国于1902年正式控制了该项目。美国用新的建筑设备代替了无法使用的法国设备,该设备设计用于更大,更快的工作规模。威廉·高加斯上校被任命为首席卫生官,负责消除蚊子传播的疾病。在克服了这些以及其他的考验和磨难之后,运河于1914年8月15日成功开放。巴拿马运河的胜利竣工表明,使用正确的工具并消除重大威胁是任何项目中的关键步骤。100多年后,随着我们进入以数据为中心的时代,范式正在发生。如今,数据极其丰富,但势不可挡,从数据中提取信息不仅需要正确的工具和方法,还需要意识到主要威胁。在这个数据密集型时代,探索相关出版物和以前实验中的可用数据集以得出可检验假设的传统方法已经过时。考虑一个事实,即每30秒发布一次新文章(Jinha,2010年)。实际上,对于糖尿病的常见疾病,迄今为止已经发表了大约500,000篇文章;即使科学家每天阅读20篇论文,他将需要68年才能浏览所有材料。标准方法根本无法充分处理大量文档或数据集的指数增长。一个主要的威胁是领域知识的规范无法被消耗并保存在人类记忆中。如果没有有效的方法来处理信息,也没有办法消除有限的内存和处理数据泛滥的时间所带来的根本威胁,我们可能会发现自己像法国人一个多世纪前在巴拿马地峡上所做的那样面临失败。数据和产生假设的方法可能会成为事实方法,为那些试图掌握出版物和数据集的指数级快速扩展的人们提供核心研究成果(Evans&Foster,2011)。实际上,考虑到他们自己关注的领域中研究数量和多样性的不断增加,大多数学者从未能够保持与出版物和数据集的最新同步,更不用说在可能分离知识的相关概念领域了。通过在语法上难以理解的关键字障碍或完全不同的研究语料库,许多学科的研究社区终于认识到,随着信息技术的发展,需要新的方法来从数据密集型出版物中提取实体,以及整合和分析大规模数据集。这提供了一个令人信服的机会,可以通过使用知识图和相关框架来改进文献和数据集中的知识发现过程,该框架以以前我们无法达到的规模集成学者,领域知识,数据集,工作流和机器(Ding等人。,2013)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号