首页> 美国卫生研究院文献>Database: The Journal of Biological Databases and Curation >Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
【2h】

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

机译:有效的生物医学文献分类可用于识别与小鼠基因表达数据库(GXD)相关的出版物

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. >Database URL:
机译:基因表达数据库(GXD)是Mouse Genome Informatics资源中的综合在线数据库,旨在提供有关小鼠发育过程中内源基因表达的可用信息。这些信息主要来自数据库管理者必须阅读的数千种生物医学出版物。鉴于每年发表的生物医学论文非常多,自动文档分类在生物医学研究中起着重要作用。具体来说,需要有效且高效的文档分类器来支持GXD注释工作流程。我们在这里提出一种有效而相对简单的分类方案,该方案在使用特征选择时使用了现成的工具,目的是协助策展人识别与GXD相关的出版物。我们在大型人工策划的数据集上检查了我们方法的性能,该数据集由25,000多个PubMed摘要组成,其中大约一半与GXD相关,而另一半与GXD不相关。除了标题和摘要中的文本之外,我们还考虑图像标题,图像标题是我们集成到方法中的重要信息源。我们将基于字幕的分类器应用于大约3300个文档的子集,这些文档的全文可在其中找到。结果表明,我们提出的方法是健壮的,并且有效地解决了GXD文档分类问题。此外,与单独的标题和摘要相比,使用从图像标题获得的信息明显提高了性能,从而确认了图像标题作为自动确定生物医学出版物与特定主题领域相关性的重要证据来源的实用性。 >数据库网址:

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号