Naming multi-modal clusters to identify persons in TV broadcast

Poignant Johann; Fortier Guillaume; Besacier Laurent; Quenot Georges

首页> 外文期刊>Multimedia Tools and Applications >Naming multi-modal clusters to identify persons in TV broadcast

【24h】

Naming multi-modal clusters to identify persons in TV broadcast

机译：命名多模式集群以识别电视广播中的人物

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Persons' identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2% for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.

机译：电视广播中的人员识别是索引此类视频的主要工具之一。经典方法是使用生物特征识别的面部和说话者模型，但是要覆盖相当数量的人，则需要昂贵的注释。近年来，一些著作提出使用其他名称来源来识别人，例如发音名称和书面名称。主要思想是基于面部/说话者集群的相似性，并将这些名称传播到集群中。在本文中，我们提出了一种在区分过程中利用书面名称的方法，以便为两个名称集群命名并防止两个名称不同的集群融合。首先，我们使用LOOV工具提取书面姓名（Poignant等，2012）；这些名称与它们共同出现的扬声器转弯/面部轨迹相关。同时，我们建立了扬声器转弯和面部轨迹之间距离的多模式矩阵。然后在该约束条件下对该矩阵执行聚集聚类，以避免合并与不同名称关联的聚类。我们还集成了一些生物特征识别模型（锚点，一些记者）的预测，以在聚类过程之前直接识别说话者的转弯/面部表情。我们的方法在REPERE语料库上进行了评估，F值用于说话人识别和脸部识别分别为68.2％和60.2％。添加少量生物特征识别模型可以改善结果，并分别使说话者和面部识别获得82.4％和65.6％的收益。相比之下，具有706个扬声器模型的单模式，受监督的人身识别系统接受了匹配的发展数据以及其他电视和广播数据的训练，提供了67.8％的F值，而908个面部模型仅提供了30.5％的F值。

著录项

来源
《Multimedia Tools and Applications》 |2016年第15期|8999-9023|共25页
作者
Poignant Johann; Fortier Guillaume; Besacier Laurent; Quenot Georges;
展开▼
作者单位

Univ Grenoble Alpes, LIG, F-38000 Grenoble, France|CNRS, LIG, F-38000 Grenoble, France;

Univ Grenoble Alpes, LIG, F-38000 Grenoble, France|CNRS, LIG, F-38000 Grenoble, France;

Univ Grenoble Alpes, LIG, F-38000 Grenoble, France|CNRS, LIG, F-38000 Grenoble, France;

Univ Grenoble Alpes, LIG, F-38000 Grenoble, France|CNRS, LIG, F-38000 Grenoble, France;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Multimodal fusion; VideoOCR; Face and speaker identification; TV broadcast;

机译：多模式融合;VideoOCR;面部和说话人识别;电视广播;

相似文献

外文文献
中文文献
专利

1. Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast [J] . Hervé Bredin, Anindya Roy, Viet-Bac Le, International Journal of Multimedia Information Retrieval . 2014,第3期

机译：用于多媒体数据中的单模式，跨模式和多模式人员识别的人员实例图：在电视广播中的说话人识别中的应用
2. Unsupervised Speaker Identification in TV Broadcast Based on Written Names [J] . Poignant J., Besacier L., Quenot G. Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2015,第1期

机译：基于书面姓名的电视广播中无监督说话人识别
3. Multimodal person discovery in broadcast TV: lessons learned from MediaEval 2015 [J] . Poignant Johann, Bredin Herve, Barras Claude Multimedia Tools and Applications . 2017,第21期

机译：广播电视中的多模式人员发现：从MediaEval 2015中汲取的经验教训
4. Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ? [C] . Johann Poignant, Laurent Besacier, Viet Bac Le, Conference of the International Speech Communication Association . 2013

机译：广播电视中的扬声器未经监督的命名：使用书面名称，发音或两者兼而有之？
5. The Revolution Will Be Televised: Identifying, Organizing, and Presenting Correlations Between Social Media and Broadcast Television [D] . Riley, Patrick Florence. 2011

机译：革命将进行电视转播：识别，组织和呈现社交媒体与广播电视之间的关联
6. Clusters of functional domains to identify older persons at risk of disability [O] . Luisa Costanzo, Claudio Pedone, Matteo Cesari, -1

机译：功能域集群以识别有残疾风险的老年人
7. Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast [O] . Hervé Bredin, Anindya Roy, Viet-Bac Le, 2014

机译：多媒体数据中的单次，交叉和多模态人识别的人实例图：在电视广播中的扬声器识别应用
8. Broadcast and Cable Television: Requirements for Identifying Sponsored Programming Should Be Clarified. [R] . 2013

机译：广播和有线电视：应明确确定赞助节目的要求。

Naming multi-modal clusters to identify persons in TV broadcast

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅