首页> 美国卫生研究院文献>BMC Systems Biology >A framework for biomedical figure segmentation towards image-based document retrieval
【2h】

A framework for biomedical figure segmentation towards image-based document retrieval

机译:用于基于图像的文档检索的生物医学图形分割框架

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels.In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures.In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.
机译:许多生物医学出版物中包含的数字在理解其中描述的生物学实验和事实方面起着重要作用。最近的研究表明,可以将从图形中提取的信息集成到经典文档分类和检索任务中,以提高其准确性。关于生物医学出版物中包含的数字的一个重要观察结果是,它们通常由多个子图或面板组成,每个子图或面板描述不同的方法或结果。这些多峰图形的使用是生物科学中的一种常见实践,因为通过多种方法或程序以图形方式验证了实验结果。因此,为了在文档分类或检索任务中更好地使用多峰图形,以及为派生的断言提供证据来源,将多峰图形自动细分为子图和面板非常重要。但是,这是一项具有挑战性的任务,因为不同的面板可以包含具有多种布局的相似对象(即,条形图和折线图)。而且,某些类型的生物医学图形是重文本的(例如,DNA序列和蛋白质序列图像),它们与传统图像不同。结果,基于边缘或颜色等低级图像特征的经典图像分割技术无法直接应用于将多峰图形稳健地划分为单个模态面板。本文中,我们描述了一种用于自动识别和分段的健壮解决方案多峰图中的单峰面板。我们的框架始于从生物医学文章中稳固地获取图形标题对。我们的方法基于以下观察:文档布局可用于识别PDF文件中的编码图形和图形边界。考虑到文档布局,我们可以从PDF文档中正确提取图形并关联其相应的标题。我们将提取图像的像素级表示与从其相应标题中收集的信息结合起来,以估计图中的面板数。因此,我们的方法可以同时识别面板的数量和图形的布局。为了评估此处描述的方法,我们将我们的系统应用于包含蛋白质-蛋白质相互作用(PPI)的文档中,并将结果与​​标注的黄金标准进行了比较由生物学家。实验结果表明,我们的自动人物分割方法超越了基于纯字幕和基于图像的方法,达到了96.64%的准确性。为了有效地检索信息,并为集成到文档分类和检索系统等提供基础,我们进一步开发了基于Web的界面,该界面使用户可以轻松地检索包含用户查询中指定的术语的面板。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号