首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
【24h】

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

机译:弱监督时空的自然句子在视频中

获取原文

摘要

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that se-manttcally corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the Im-ageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches.
机译:在本文中,我们解决了一项新的任务,即视频中的弱监督时空临时自然句子。具体来说,给定自然句子和视频,我们在视频中定位时空管,该时空管与给定句子对应,并且在训练过程中不依赖任何时空注释。首先,从视频中提取一组称为实例的时空管道。然后,我们使用我们提出的殷勤交互器对这些实例和句子进行编码,该交互器​​可以利用它们的细粒度关系来表征其匹配行为。除了排名损失之外,还引入了一种新颖的分集损失来训练拟议的注意力交互器,以增强可靠的实例-句子对的匹配行为,并对不可靠的实例-句子对进行惩罚。此外,我们还基于Im-ageNet视频对象检测数据集提供了一个称为VID句子的数据集,以作为我们任务的基准。大量的实验结果证明了我们的模型优于基线方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号