首页> 外文期刊>Data & Knowledge Engineering >The phrase-based vector space model for automatic retrieval of free-text medical documents
【24h】

The phrase-based vector space model for automatic retrieval of free-text medical documents

机译:自动检索自由文本医学文档的基于短语的向量空间模型

获取原文
获取原文并翻译 | 示例

摘要

Objective: To develop a document indexing scheme that improves the retrieval effectiveness for free-text medical documents. Design: The phrase-based vector space model (VSM) uses multi-word phrases as indexing terms. Each phrase consists of a concept in the unified medical language system (UMLS) and its corresponding component word stems. The similarity between concepts are defined by their relations in a hypernym hierarchy derived from UMLS. After defining the similarity between two phrases by their stem overlaps and the similarity between the concepts they represent, we define the similarity between two documents as the cosine of the angle between their corresponding phrase vectors. This paper reports the development and the validation of the phrase-based VSM. Measurement: We compare the retrieval effectiveness of different vector space models using two standard test collections, OHSUMED and Medlars. OHSUMED contains 105 queries and 14,430 documents, and Medlars contains 30 queries and 1033 documents. Each document in the test collections is judged by human experts to be either relevant or non-relevant to each query. The retrieval effectiveness is measured by precision and recall. Results: The phrase-based VSM is significantly more effective than the current gold standard—the stem-based VSM. Such significant retrieval effectiveness improvements are observed in both the exhaustive search and cluster-based document retrievals. Conclusion: The phrase-based VSM is a better indexing scheme than the stem-based VSM. Medical document retrieval using the phrase-based VSM is significantly more effective than that using the stem-based VSM.
机译:目的:开发一种文档索引方案,以提高自由文本医学文档的检索效率。设计:基于短语的向量空间模型(VSM)使用多单词短语作为索引词。每个短语由统一医学语言系统(UMLS)中的一个概念及其相应的词干组成。概念之间的相似性由它们之间的关系定义,这些关系在源自UMLS的上位词层次结构中。在通过词干重叠定义两个短语之间的相似度以及它们表示的概念之间的相似度之后,我们将两个文档之间的相似度定义为它们对应的短语向量之间的角度的余弦值。本文报告了基于短语的VSM的开发和验证。测量:我们使用OHSUMED和Medlars这两个标准测试集合比较了不同向量空间模型的检索效果。 OHSUMED包含105个查询和14,430个文档,而Medlars包含30个查询和1033个文档。测试集合中的每个文档均由人类专家判断为与每个查询相关或无关。检索效率通过精度和查全率来衡量。结果:基于短语的VSM比当前的金标准-基于词干的VSM明显更有效。在穷举搜索和基于群集的文档检索中都可以观察到这种显着的检索效率改进。结论:基于短语的VSM比基于词干的VSM是更好的索引方案。使用基于短语的VSM检索医疗文档比使用基于词干的VSM检索要有效得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号