首页> 外文期刊>The Electronic Library >Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora
【24h】

Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora

机译:为基于内容的推荐系统选择文本相似性度量:两个语料库中的比较

获取原文
获取原文并翻译 | 示例
       

摘要

Purpose The purpose of this paper is to develop a journal recommender system, which compares the content similarities between a manuscript and the existing journal articles in two subject corpora (covering the social sciences and medicine). The study examines the appropriateness of three text similarity measures and the impact of numerous aspects of corpus documents on system performance. Design/methodology/approach Implemented three similarity measures one at a time on a journal recommender system with two separate journal corpora. Two distinct samples of test abstracts were classified and evaluated based on the normalized discounted cumulative gain. Findings The BM25 similarity measure outperforms both the cosine and unigram language similarity measures overall. The unigram language measure shows the lowest performance. The performance results are significantly different between each pair of similarity measures, while the BM25 and cosine similarity measures are moderately correlated. The cosine similarity achieves better performance for subjects with higher density of technical vocabulary and shorter corpus documents. Moreover, increasing the number of corpus journals in the domain of social sciences achieved better performance for cosine similarity and BM25. Originality/value This is the first work related to comparing the suitability of a number of string-based similarity measures with distinct corpora for journal recommender systems.
机译:目的本文的目的是开发一个期刊推荐系统,比较两个主题语料库(涵盖社会科学和医学)中手稿和现有期刊文章之间的内容相似性。该研究检查了三种文本相似性度量的适当性以及语料库文档许多方面对系统性能的影响。设计/方法/方法在具有两个独立期刊语料库的期刊推荐系统上,一次实施三个相似性度量。根据标准化的折现累积收益对两个不同的测试摘要样本进行分类和评估。结果BM25相似性度量总体上优于余弦和唯一字母相似性度量。 unigram语言度量显示最低的性能。每对相似性度量之间的性能结果显着不同,而BM25和余弦相似性度量则具有中等相关性。余弦相似度对于具有较高技术词汇密度和较短语料库文档的对象具有更好的性能。此外,在社会科学领域,语料库期刊的数量增加,在余弦相似度和BM25方面取得了更好的表现。原创性/价值这是有关将多个基于字符串的相似性度量与适用于期刊推荐系统的不同语料库的适用性进行比较的第一项工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号