首页> 外文学位 >Hybrid query expansion on ontology graph in biomedical information retrieval.
【24h】

Hybrid query expansion on ontology graph in biomedical information retrieval.

机译:生物医学信息检索中本体图的混合查询扩展。

获取原文
获取原文并翻译 | 示例

摘要

Nowadays, biomedical researchers publish thousands of papers and journals every day. Searching through biomedical literature to keep up with the state of the art is a task of increasing difficulty for many individual researchers. The continuously increasing amount of biomedical text data has resulted in high demands for an efficient and effective biomedical information retrieval (BIR) system. Though many existing information retrieval techniques can be directly applied in BIR, BIR distinguishes itself in the extensive use of biomedical terms and abbreviations which present high ambiguity.;First of all, we studied a fundamental yet simpler problem of word semantic similarity. We proposed a novel semantic word similarity algorithm and related tools called Weighted Edge Similarity Tools (WEST). WEST was motivated by our discovery that humans are more sensitive to the semantic difference due to the categorization than that due to the generalization/specification. Unlike most existing methods which model the semantic similarity of words based on either the depth of their Lowest Common Ancestor (LCA) or the traversal distance of between the word pair in WordNet, WEST also considers the joint contribution of the weighted distance between two words and the weighted depth of their LCA in WordNet. Experiments show that weighted edge based word similarity method has achieved 83.5% accuracy to human judgments.;Query expansion problem can be viewed as selecting top k words which have the maximum accumulated similarity to a given word set. It has been proved as an effective method in BIR and has been studied for over two decades. However, most of the previous researches focus on only one controlled vocabulary: MeSH. In addition, early studies find that applying ontology won't necessarily improve searching performance. In this dissertation, we propose a novel graph based query expansion approach which is able to take advantage of the global information from multiple controlled vocabularies via building a biomedical ontology graph from selected vocabularies in Metathesaurus. We apply Personalized PageRank algorithm on the ontology graph to rank and identify top terms which are highly relevant to the original user query, yet not presented in that query. Those new terms are reordered by a weighted scheme to prioritize specialized concepts. We multiply a scaling factor to those final selected terms to prevent query drifting and append them to the original query in the search. Experiments show that our approach achieves 17.7% improvement in 11 points average precision and recall value against Lucene's default indexing and searching strategy and by 24.8% better against all the other strategies on average. Furthermore, we observe that expanding with specialized concepts rather than generalized concepts can substantially improve the recall-precision performance.;Furthermore, we have successfully applied WEST from the underlying WordNet graph to biomedical ontology graph constructed by multiple controlled vocabularies in Metathesaurus. Experiments indicate that WEST further improve the recall-precision performance.;Finally, we have developed a Graph-based Biomedical Search Engine (G-Bean) for retrieving and visualizing information from literature using our proposed query expansion algorithm. G-Bean accepts any medical related user query and processes them with expanded medical query to search for the MEDLINE database.
机译:如今,生物医学研究人员每天都会发表数千篇论文和期刊。在生物医学文献中进行搜索以跟上最新技术发展的步伐,这对许多个人研究人员来说都是一项日益艰巨的任务。生物医学文本数据的数量不断增加,因此对高效和有效的生物医学信息检索(BIR)系统提出了很高的要求。尽管许多现有的信息检索技术可以直接应用于BIR,但是BIR在生物医学术语和缩写的广泛使用方面表现出了自己的特色,这些术语和缩写具有很高的歧义性。首先,我们研究了一个基本而又简单的词语义相似性问题。我们提出了一种新颖的语义词相似度算法及相关工具,称为加权边缘相似度工具(WEST)。 WEST受到我们的发现的启发,因为与归因/规范相比,归因于分类,人类对语义差异更为敏感。与大多数现有的基于单词的最低共同祖先(LCA)的深度或WordNet中单词对之间的遍历距离对单词的语义相似性进行建模的方法不同,WEST还考虑了两个单词之间的加权距离和在WordNet中其LCA的加权深度。实验表明,基于加权边的词相似度方法在人工判断上达到了83.5%的准确率。查询扩展问题可以看做是选择与给定词集具有最大累积相似度的前k个词。它已被证明是BIR中的一种有效方法,并且已经研究了二十多年。但是,以前的大多数研究都只关注一个受控词汇:MeSH。此外,早期研究发现,应用本体并不一定会提高搜索性能。在本文中,我们提出了一种基于图的查询扩展方法,该方法可以通过从词库中选择的词汇建立生物医学本体图来利用来自多个受控词汇的全局信息。我们在本体图上应用了Personalized PageRank算法,以对与原始用户查询高度相关但在该查询中未显示的最高术语进行排名和标识。这些新术语通过加权方案进行重新排序,以优先处理专门概念。我们将比例因子乘以最终选择的那些项以防止查询漂移,并将其附加到搜索中的原始查询中。实验表明,与Lucene的默认索引和搜索策略相比,我们的方法在11点的平均精度和召回值上提高了17.7%,与所有其他策略相比平均提高了24.8%。此外,我们观察到,使用专业概念而不是广义概念进行扩展可以大大提高召回精度。;此外,我们已经成功地将WEST从底层WordNet图应用到Metathesaurus中由多个受控词汇表构建的生物医学本体图上。实验表明,WEST进一步提高了查全率的性能。最后,我们开发了一种基于图的生物医学搜索引擎(G-Bean),用于使用我们提出的查询扩展算法从文献中检索和可视化信息。 G-Bean接受任何与医学相关的用户查询,并使用扩展的医学查询对其进行处理以搜索MEDLINE数据库。

著录项

  • 作者

    Dong, Liang.;

  • 作者单位

    Clemson University.;

  • 授予单位 Clemson University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 139 p.
  • 总页数 139
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号