首页> 外文会议>International conference on text, speech and dialogue >SuMACC Project's Corpus: A Topic-Based Query Extension Approach to Retrieve Multimedia Documents
【24h】

SuMACC Project's Corpus: A Topic-Based Query Extension Approach to Retrieve Multimedia Documents

机译:SuMACC项目的语料库:一种基于主题的查询扩展方法,用于检索多媒体文档

获取原文

摘要

The SuMACC project aims at automatically tracking new multimodal entities on Internet. The goal of the project is to propose robust multimedia methods that define relevant patterns allowing to automatically retrieve these entities. This paper describes the SuMACC corpus collected on video-sharing platforms using word-queries. Since concepts are limited to a single or few words, querying video-sharing platforms with the concept only can easily introduce irrelevant collected videos. In this paper, we propose to use an extended query obtained by mapping the initial concept into a topic space from a Latent Dirichlet Allocation (LDA) algorithm. This topic-based query extension approach allows to better retrieve videos related to the targeted concept. As a result, a corpus of 7,517 videos, extracted using the simple {i.e. concept only) and the extended queries, from 47 concepts, was obtained. Results show the effectiveness of the proposed thematic querying approach compared to the simple concept query in terms of relevance (+21%) and ambiguity (-4%). The annotation process as well as the corpus statistics are detailed in this paper.
机译:SuMACC项目旨在自动跟踪Internet上的新多式联运实体。该项目的目标是提出健壮的多媒体方法,该方法定义相关模式以允许自动检索这些实体。本文介绍了使用单词查询在视频共享平台上收集的SuMACC语料库。由于概念仅限于一个或几个单词,因此仅使用该概念查询视频共享平台就可以轻松引入不相关的收集视频。在本文中,我们建议使用通过将潜在概念通过潜在狄利克雷分配(LDA)算法映射到主题空间而获得的扩展查询。这种基于主题的查询扩展方法可以更好地检索与目标概念相关的视频。结果,使用简单的{即仅概念),并从47个概念中获得了扩展查询。结果表明,相对于简单概念查询,所提主题查询方法在相关性(+ 21%)和歧义性(-4%)方面是有效的。本文详细介绍了注释过程以及语料库统计信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号