首页> 外文期刊>Information retrieval >#nowplaying Madonna: a large-scale evaluation on estimating similarities between music artists and between movies from microblogs
【24h】

#nowplaying Madonna: a large-scale evaluation on estimating similarities between music artists and between movies from microblogs

机译:#nowplaying麦当娜:评估音乐艺术家之间以及微博电影之间相似性的大规模评估

获取原文
获取原文并翻译 | 示例
       

摘要

Different term weighting techniques such as TF·IDF or BM25 have been used intensely for manifold text-based information retrieval tasks. Their use for modeling term profiles for named entities and subsequent calculation of similarities between these named entities have been studied to a much smaller extent. The recent trend of microblogging made available massive amounts of information about almost every topic around the world. Therefore, microblogs represent a valuable source for text-based named entity modeling. In this paper, we present a systematic and comprehensive evaluation of different term weighting measures, normalization techniques, query schemes, index term sets, and similarity functions for the task of inferring similarities between named entities, based on data extracted from microblog posts. We analyze several thousand combinations of choices for the above mentioned dimensions, which influence the similarity calculation process, and we investigate in which way they impact the quality of the similarity estimates. Evaluation is performed using three real-world data sets: two collections of microblogs related to music artists and one related to movies. For the music collections, we present results of genre classification experiments using as benchmark genre information from allmusic .com. For the movie collection, we present results of multi-class classification experiments using as benchmark categories from IMDb. We show that microblogs can indeed be exploited to model named entity similarity with remarkable accuracy, provided the correct settings for the analyzed aspects are used. We further compare the results to those obtained when using Web pages as data source.
机译:诸如TF·IDF或BM25之类的不同术语加权技术已被大量用于基于文本的多种信息检索任务。在较小范围内,已经研究了它们为命名实体的术语概况进行建模以及随后计算这些命名实体之间的相似度的用途。微博的最新趋势提供了有关世界上几乎每个主题的大量信息。因此,微博客代表了基于文本的命名实体建模的宝贵资源。在本文中,我们基于从微博帖子中提取的数据,针对推断实体之间的相似性的任务,对不同的术语加权度量,规范化技术,查询方案,索引术语集和相似性函数进行了系统,全面的评估。我们分析了上述维度的数千种选择组合,这些组合会影响相似度计算过程,并研究它们以哪种方式影响相似度估计的质量。使用三个现实世界的数据集进行评估:两个与音乐艺术家有关的微博集合和一个与电影有关的微博集合。对于音乐收藏,我们使用来自allmusic.com的基准流派信息提供流派分类实验的结果。对于电影收藏,我们提出了使用IMDb作为基准类别的多类别分类实验的结果。我们显示,只要为分析的方面使用了正确的设置,微博客确实可以被利用来以显着的准确性对命名实体相似性进行建模。我们进一步将结果与使用网页作为数据源时获得的结果进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号