...
首页> 外文期刊>Multimedia Tools and Applications >Naming multi-modal clusters to identify persons in TV broadcast
【24h】

Naming multi-modal clusters to identify persons in TV broadcast

机译:命名多模式集群以识别电视广播中的人物

获取原文
获取原文并翻译 | 示例

摘要

Persons' identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2% for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.
机译:电视广播中的人员识别是索引此类视频的主要工具之一。经典方法是使用生物特征识别的面部和说话者模型,但是要覆盖相当数量的人,则需要昂贵的注释。近年来,一些著作提出使用其他名称来源来识别人,例如发音名称和书面名称。主要思想是基于面部/说话者集群的相似性,并将这些名称传播到集群中。在本文中,我们提出了一种在区分过程中利用书面名称的方法,以便为两个名称集群命名并防止两个名称不同的集群融合。首先,我们使用LOOV工具提取书面姓名(Poignant等,2012);这些名称与它们共同出现的扬声器转弯/面部轨迹相关。同时,我们建立了扬声器转弯和面部轨迹之间距离的多模式矩阵。然后在该约束条件下对该矩阵执行聚集聚类,以避免合并与不同名称关联的聚类。我们还集成了一些生物特征识别模型(锚点,一些记者)的预测,以在聚类过程之前直接识别说话者的转弯/面部表情。我们的方法在REPERE语料库上进行了评估,F值用于说话人识别和脸部识别分别为68.2%和60.2%。添加少量生物特征识别模型可以改善结果,并分别使说话者和面部识别获得82.4%和65.6%的收益。相比之下,具有706个扬声器模型的单模式,受监督的人身识别系统接受了匹配的发展数据以及其他电视和广播数据的训练,提供了67.8%的F值,而908个面部模型仅提供了30.5%的F值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号