首页> 外文学位 >Information Retrieval in Biomedical Research: From Articles to Datasets
【24h】

Information Retrieval in Biomedical Research: From Articles to Datasets

机译:生物医学研究中的信息检索:从文章到数据集

获取原文
获取原文并翻译 | 示例

摘要

Information retrieval techniques have been applied to biomedical research for a variety of purposes, such as textual document retrieval and molecular data retrieval. As biomedical research evolves over time, information retrieval is also constantly facing new challenges, including the growing number of available data, the emerging new data types, the demand for interoperability between data resources, and the change of users' search behaviors. To help solve the challenges, I studied three solutions in my dissertation: (a) using information collected from online resources to enrich the representation models for biomedical datasets; (b) exploring rule-based and deep learning-based methods to help users formulate effective queries for both dataset retrieval and publication retrieval; and (c) developing a "retrieval plus re-ranking" strategy to identify relevant datasets, and rank them using customized ranking models.;In a biomedical dataset retrieval study, we developed a pipeline to automatically analyze users' free-text requests, and rank relevant datasets using a "retrieval plus re-ranking" strategy. To improve the representation model of biomedical datasets, we explored online resources and collected information to enrich the metadata of datasets. The rule-based query formulation module extracted keywords from users' free-text requests, expanded the keywords using NCBI resources, and finally formulated Boolean queries using pre-designed templates. The novel "retrieval plus re-ranking" strategy captured relevant datasets in the retrieval step, and ranked datasets using the customized relevance scoring functions that model unique properties of the metadata of biomedical datasets. The solutions proved to be successful for biomedical dataset retrieval, and the pipeline achieved the highest inferred Normalized Discounted Cumulative Gain (infNDCG) score in the 2016 bioCADDIE Biomedical Dataset Retrieval Challenge.;In a biomedical publication retrieval study, we developed the eXtended PubMed Related Citation (XPRC) algorithm to find similar articles in PubMed. Currently, similar articles in PubMed are determined by the PubMed Related Citation (PRC) algorithm. However, when the distributions of term counts are similar between articles, the PRC algorithm may conclude that the articles are similar, even though they may be about different topics. On the other hand, when two articles discuss the same topic but use different terms, the PRC algorithm may miss the similarity. For the above problem, we implemented a term expansion method to help capture the similarity. Unlike popular ontology-based expansion methods, we used a deep learning method to learn distributed representations of terms over one million articles from PubMed Central, and identified similar terms using the Euclidean distance between distributed representation vectors. We showed that, under certain conditions, using XPRC can improve precision, and helps find similar articles from PubMed.;In conclusion, information retrieval techniques in biomedical research have helped researchers find desired publications, datasets, and other information. Further research on developing robust representation models, intelligent query formulation systems, and effective ranking models will lead to smarter and more friendly information retrieval systems that will further promote the transformation from data to knowledge in biomedicine.
机译:信息检索技术已出于多种目的而应用于生物医学研究,例如文本文档检索和分子数据检索。随着生物医学研究的发展,信息检索也不断面临新的挑战,包括可用数据的数量不断增加,新兴数据类型的出现,对数据资源之间互操作性的需求以及用户搜索行为的变化。为了帮助解决挑战,我在论文中研究了三种解决方案:(a)使用从在线资源中收集的信息来丰富生物医学数据集的表示模型; (b)探索基于规则和基于深度学习的方法,以帮助用户为数据集检索和出版物检索制定有效的查询; (c)开发“检索加重新排序”策略以识别相关数据集,并使用定制的排名模型对它们进行排名。;在生物医学数据集检索研究中,我们开发了自动分析用户自由文本请求的管道,以及使用“检索加重新排名”策略对相关数据集进行排名。为了改善生物医学数据集的表示模型,我们探索了在线资源并收集了信息以丰富数据集的元数据。基于规则的查询制定模块从用户的自由文本请求中提取关键字,使用NCBI资源扩展关键字,最后使用预先设计的模板制定布尔查询。新颖的“检索加重新排序”策略在检索步骤中捕获了相关数据集,并使用对生物医学数据集的元数据的独特属性建模的自定义相关性评分功能对数据集进行了排名。该解决方案在生物医学数据集检索中被证明是成功的,并且该管道在2016年bioCADDIE生物医学数据集检索挑战中获得了最高的推断归一化贴现累积增益(infNDCG)分数;在生物医学出版物检索研究中,我们开发了扩展的PubMed相关引文(XPRC)算法可在PubMed中找到相似的文章。当前,PubMed中的类似文章是通过PubMed相关引用(PRC)算法确定的。但是,当文章之间的术语计数分布相似时,PRC算法可能会得出结论,即文章相似,即使它们可能涉及不同的主题。另一方面,当两篇文章讨论相同的主题但使用不同的术语时,PRC算法可能会遗漏相似性。针对上述问题,我们实施了术语扩展方法以帮助捕获相似性。与流行的基于本体的扩展方法不同,我们使用深度学习方法从PubMed Central学习超过一百万篇文章的术语的分布式表示形式,并使用分布式表示向量之间的欧式距离来识别相似的术语。我们表明,在一定条件下,使用XPRC可以提高精度,并有助于从PubMed中查找相似的文章。总之,生物医学研究中的信息检索技术已帮助研究人员找到了所需的出版物,数据集和其他信息。对开发健壮的表示模型,智能查询制定系统和有效排名模型的进一步研究将导致更智能,更友好的信息检索系统,这将进一步促进生物医学从数据到知识的转化。

著录项

  • 作者

    Wei, Wei.;

  • 作者单位

    University of California, San Diego.;

  • 授予单位 University of California, San Diego.;
  • 学科 Bioinformatics.;Information science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 139 p.
  • 总页数 139
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:54:24

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号