Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

Vali Ollah Maraghi; Karim Faez

摘要

Recognition of human activities is an essential field in computer vision. The most human activity consists of the interaction between humans and objects. Many successful works have been done on human-object interaction (HOI) recognition and achieved acceptable results in recent years. Still, they are fully supervised and need to train labeled data for all HOIs. Due to the enormous space of human-object interactions, listing and providing the training data for all possible categories is costly and impractical. We propose an approach for scaling human-object interaction recognition in video data through the zero-shot learning technique to solve this problem. Our method recognizes a verb and an object from the video and makes an HOI class. Recognition of the verbs and objects instead of HOIs allows identifying a new combination of verbs and objects. So, a new HOI class can be identified, which is not seen by the recognizer system. We introduce a neural network architecture that can understand and represent the video data. The proposed system learns verbs and objects from available training data at the training phase and can identify the verb-object pairs in a video at test time. So, the system can identify the HOI class with different combinations of objects and verbs. Also, we propose to use lateral information for combining the verbs and the objects to make valid verb-object pairs. It helps to prevent the detection of rare and probably wrong HOIs. The lateral information comes from word embedding techniques. Furthermore, we propose a new feature aggregation method for aggregating extracted high-level features from video frames before feeding them to the classifier. We illustrate that this feature aggregation method is more effective for actions that include multiple subactions. We evaluated our system by recently introduced Charades challengeable dataset, which has lots of HOI categories in videos. We show that our proposed system can detect unseen HOI classes in addition to the acceptable recognition of seen types. Therefore, the number of classes identifiable by the system is greater than the number of classes used for training.

机译：人类活动的识别是计算机视觉的一个重要领域。人类活动最多的是人与物体之间的相互作用。近年来，在人物交互（HOI）识别方面取得了许多成功工作，并取得了可观的成果。尽管如此，他们仍然受到充分监督，需要为所有 HOI 训练标记数据。由于人与物体交互的巨大空间，列出和提供所有可能类别的训练数据既昂贵又不切实际。为了解决这个问题，我们提出了一种通过零样本学习技术在视频数据中扩展人物交互识别的方法。我们的方法从视频中识别动词和对象，并创建一个 HOI 类。识别动词和宾语而不是 HOI 可以识别动词和宾语的新组合。因此，可以识别识别器系统看不到的新 HOI 类。我们介绍了一种可以理解和表示视频数据的神经网络架构。所提出的系统在训练阶段从可用的训练数据中学习动词和宾语，并可以在测试时识别视频中的动词-宾语对。因此，系统可以使用不同的对象和动词组合来识别 HOI 类。此外，我们建议使用横向信息来组合动词和宾语，以形成有效的动词-宾语对。它有助于防止检测到罕见且可能错误的 HOI。横向信息来自词嵌入技术。此外，我们提出了一种新的特征聚合方法，用于聚合从视频帧中提取的高级特征，然后再将其馈送到分类器。我们说明了这种特征聚合方法对于包含多个子操作的操作更有效。我们通过最近推出的 Charades 挑战数据集来评估我们的系统，该数据集在视频中有很多 HOI 类别。我们表明，我们提出的系统除了可以识别可见类型外，还可以检测看不见的HOI类。因此，系统可识别的类数大于用于训练的类数。

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

摘要

著录项

引文网络

相关主题

期刊订阅